fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

The Indexer job #7

Closed java8964 closed 9 years ago

java8964 commented 9 years ago

This is not a really an issue, but more like a question. It looks like that the Index build part is running within the driver, multithreading. It works for my test data. What I want to know how is the performance on your production about Indexing the data? Is it making sense to use MR jobs to build the index? Our production system has over 10T SSTable files. I kind of worrying running indexing within one driver could be the bottleneck in this case. What is your guys experience?

Thanks

bvanberg commented 9 years ago

Yes, and I'm working on a distributed version of the indexer now. Should be available soon. However, we regularly index ~5TB of data and it does take some time on the order of 1-2 hours. Definitely a bottle neck which will be resolved with a distributed indexer. I'll close this issue when I have that done. If you need it soon let me know and I can prioritize. If you'd like to take a crack at it yourself also let me know. Pull requests always welcome.

bvanberg commented 9 years ago

Initial cut of the a distributed indexer is available.