fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

The ongoing road map #31

Open java8964 opened 8 years ago

java8964 commented 8 years ago

Hi, Hadoop-sstable is a great idea for processing the C* sstable files efficient, but I start thinking this is a dead end for the future C* version. In our environments, we have lots of datasets stored in C, and I tried fork your code and keep supporting new types and new version of C, and here is some output from at least my effort:

1) C* doesn't have clean and easy internal API to help us to parse the collection type data out from the SSTable in C* 2.x base. I already gave up this path, and use Spark loading the data from C* into HDFS for small/media datasets, and force the C* 2.0/2.1 schema to support CDC in our end. 2) C* 2.1 also causes trouble for us now, as the internal C* API to dedicate the SSTable file random access toward JDK. This makes the random access the SSTable files on HDFS extreme difficult. This is maybe one of the reason you guys cannot support C* 2.1 yet.

I wonder what are you guys opinion about this? What do you think about the C* 2.1 or even 3.0 support of hadoop-sstable, and especially all the new types coming in the future version?

Thanks

Yong