Closed DazhuangSu closed 9 years ago
Quite easy: have a look at the 4mc format spec here https://github.com/carlomedas/4mc/blob/master/4mc-format-spec
As you can see there is a footer containing the indexes. So in 4mc files you have full indexes inside the same file and when running hadoop jobs they will leverage it automatically.
So 4mc is composed of a signature, a file header, compressed blocks and an index footer. And lz4 is composed of sequences. Does one compressed block represent one sequence of lz4 file or a whole sequence list? If one compressed block represents a whole sequence list(a complete lz4 file), then I think footer is similar to lzo index file, and you create a new format, 4mc, so you can put the index file into a lz4 file. Do I understand you idea correctly?
yes exactly, you got it correctly
Okay. Thank you very much. It's better than a external index file. Only need to do some configurations to support a new format.
Yes, you can then use it in hadoop as input and output format, and also externally from hadoop to open files and process them.
I'm using lzo format in Hadoop, and I need to create and read a index file to support splittable which may cause some bizarre problems. I have look at the format of lz4 and lzo, find they are very similar. And I don't have a clue about how you make lz4 support splittable. Could you give me a short brief how you did it? Thank you.