fingltd / 4mc

4mc - splittable lz4 and zstd in hadoop/spark/flink
Other
108 stars 38 forks source link

merge to hadoop? #4

Closed Tagar closed 8 years ago

Tagar commented 9 years ago

I'm surprized it's not yet part of Apache Hadoop project :) LZO is a pain to index. Plus has some licensing issues. Great project.

carlomedas commented 9 years ago

Thanks for good feedback. On Hadoop 2.x by default you have LZ4 Codec but it's not configurable w.r.t. desired compression ratio and also not actually providing any splittability. I would be happy to see this as patch to hadoop 2.x, but so far I was not even able to get attention of ElephantBird guys to work on an integration of 4mc into EB to replace LZO.

Tagar commented 9 years ago

I just emailed Cloudera folks to have a look and file a JIRA ticket to integrate it in. Hopefully this will get integrated. Thanks a lot!

carlomedas commented 9 years ago

Thanks!

svravitej commented 9 years ago

please let us know when it is integrated.

waiting for integration with hadoop

ianoc commented 9 years ago

EB as in elephantbird from twitter? Do you have a PR/issue to add support?

(Replacing isn't really an option for something like a serialization library since people have TB/PB's of data written with existing formats).

carlomedas commented 9 years ago

Yes sorry 'replacing' is wrong here, 'add support' makes much more sense. I got in touch with some EB dev but never had positive feedback about the idea of integration, thus I never did open a PR/issue on EB about that.

ianoc commented 9 years ago

I think we'd be fine with the integration, we @ twitter aren't super likely to use it. Though I'd like to try it out, will probably do that outside EB. We have discussed getting off those container formats in EB, so if we were to migrate it would more likely be to something sequence file based for ourselves(which handles splitting regardless of compression). But the extra options and such I plan on trying out from 4mc to see how they perform for our existing lz4 use cases now

carlomedas commented 9 years ago

Very good, let me know what you think and how you find it. Moreover I agree with your approach as well, using protobuf container is not best option from performance point of view when you have already a super-packet containing other info. In our tests we saw some little performance degradation when moving from our data-blocks (compressed with LZ4 anyways) to EB/4mc (also inside only C++ native code). Of course it was more than acceptable wrt the scalability we have in hadoop/EB architecture and most of all wrt having the EB framework coded and bug-free already :)

svravitej commented 8 years ago

Hi,

I think I am not in anyway connected to this mail. Please remove me from the notifications.

Regards, Ravitej

On Mon, Jul 25, 2016 at 5:53 AM, Carlo Medas notifications@github.com wrote:

Closed #4 https://github.com/carlomedas/4mc/issues/4.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/carlomedas/4mc/issues/4#event-733366307, or mute the thread https://github.com/notifications/unsubscribe-auth/ANI2ORU7_dR4EqNqdGqNs_3BoycgPnz-ks5qZJWsgaJpZM4ELBLt .

Regards

RaviTej Somayajula