fingltd / 4mc

4mc - splittable lz4 and zstd in hadoop/spark/flink
Other
108 stars 36 forks source link

4mc codecs should implement SplittableCompressionCodec #24

Open pradeepg26 opened 7 years ago

pradeepg26 commented 7 years ago

The implementation of Codec and InputFormat seems to follow the pattern from Elephantbird. However, this isn't a good pattern in my opinion. In the spirit of Hadoop, the concept of compression and file format should be decoupled. We should be able to change compression formats without needed to change the way those files are read.

Currently, if we change the compression from e.g. gz to 4mc, we need to change the InputFormat that is used to read the files, and we wouldn't be able to change the compression again. To do this gracefully, we would need to code defensively and dynamically change the InputFormats based on what files are in the input location. I don't think this strategy would work if you have a directory that has files that have been compressed with different formats.

In order to support this type of flexibility, the 4mc codecs should implement the SplittableCompressionCodec interface. This provides existing formats the ability to gracefully handle the new compression formats.

carlomedas commented 7 years ago

Hello there.

Is this a new interface coming with a new hadoop version or something like that?

pradeepg26 commented 7 years ago

Nope, it's been around for a while. Take a look at BZip2Codec for an example on how it's intended to be used.

carlomedas commented 7 years ago

You say you would like to change compression algo inside 4mc, but it's currently not supported. As matter of fact to provide both lz4 and zstd I created both 4mc and 4mz, dedicated to each of them. The good news is that a splittable compression format is now discussed in zstandard itself, so it's going to be available at the source itself very soon.

pradeepg26 commented 7 years ago

Great to hear that zstd is working on splittable compression format. I'll probably just wait for that.

In the mean time, I'm not proposing to change the compression algo inside 4mc. Just a refactor of the code to move where the splits are being adjusted. Currently the splits are being adjusted in the FourMcInputFormat and FourMzInputFormat in the getSplits method. If we adjusted the split boundaries inside the SplitCompressionInputStream instead, we wouldn't need the specialized input formats.

I'm working on a patch to implement this, should be out soon.

carlomedas commented 7 years ago

OK perfect let me know.