fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

WIP: SSTableRowInputFormat impl for the old `mapred` style API #18

Closed eentzel closed 9 years ago

eentzel commented 9 years ago

@bvanberg

If we want to make a Scalding source out of SSTableInputFormat, I think this is the start of what we'd need. It's basically copy-n-paste from SSTableInputFormat, extending org.apache.hadoop.mapred.FileInputFormat instead of org.apache.hadoop.mapreduce.FileInputFormat.

The thing I'm hung up now is a RecordReader implementation to go with it — the old & new interfaces are just different enough that I'm not quite sure how to translate the existing implementation.

bvanberg commented 9 years ago

Yep, it does look like cascading STILL only supports the older mapred API. Sounds like a design decision they made long ago and it still remains. Unfortunately SSTableInputFormat wasn't designed to be used with cascading/scalding which is why we now have this impedance mismatch. :disappointed:

I would use this if it works out of the box with little effort. Otherwise I would stick with what we have. It's still pretty fast to get the output you need.