very curious about the design principles

fingltd / 4mc

4mc - splittable lz4 and zstd in hadoop/spark/flink

Other

108 stars 36 forks source link

very curious about the design principles #11

Closed DazhuangSu closed 9 years ago

DazhuangSu commented 9 years ago

I'm using lzo format in Hadoop, and I need to create and read a index file to support splittable which may cause some bizarre problems. I have look at the format of lz4 and lzo, find they are very similar. And I don't have a clue about how you make lz4 support splittable. Could you give me a short brief how you did it? Thank you.

carlomedas commented 9 years ago

Quite easy: have a look at the 4mc format spec here https://github.com/carlomedas/4mc/blob/master/4mc-format-spec

As you can see there is a footer containing the indexes. So in 4mc files you have full indexes inside the same file and when running hadoop jobs they will leverage it automatically.

DazhuangSu commented 9 years ago

So 4mc is composed of a signature, a file header, compressed blocks and an index footer. And lz4 is composed of sequences. Does one compressed block represent one sequence of lz4 file or a whole sequence list? If one compressed block represents a whole sequence list(a complete lz4 file), then I think footer is similar to lzo index file, and you create a new format, 4mc, so you can put the index file into a lz4 file. Do I understand you idea correctly?

carlomedas commented 9 years ago

yes exactly, you got it correctly

DazhuangSu commented 9 years ago

Okay. Thank you very much. It's better than a external index file. Only need to do some configurations to support a new format.

carlomedas commented 9 years ago

Yes, you can then use it in hadoop as input and output format, and also externally from hadoop to open files and process them.