fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

S3 as data input path. #29

Closed abstract-karshit closed 8 years ago

abstract-karshit commented 8 years ago

Hi, I am trying to read data directly from S3 and it fails. It does not gives any error or throw exception but simply first generates an index file of 0 kb. Also, all the JSON part files generated are empty (0 kb). Needed help to read input (sstables) directly from S3.

bvanberg commented 8 years ago

Hey Karshit, are your sstables from a Priam backup or just raw sstables? Also, which version of C* did the sstables come from? If you have some sample file names that will tell me if they are supported sstable versions or not.

abstract-karshit commented 8 years ago

Hi, The files are supported as when I read them from HDFS and write to S3, it works perfectly fine without any problem and Json files are generated with proper data. Only when I try to read them from S3 (and not HDFS) it displays this weird behaviour. Logs are like this :

screen shot 2015-12-29 at 8 17 48 pm screen shot 2015-12-29 at 8 18 54 pm
abstract-karshit commented 8 years ago

It is not priam backup but simple incremental backup files from cassandra 2.x cluster.

bvanberg commented 8 years ago

Are you able to extract any errors/stacktrace from the task attempt logs on your cluster?

abstract-karshit commented 8 years ago

Hi Ben, today I automated the entire process, only hop being that I have to first get data (sstables) from S3 to HDFS and then process them. I will be more then happy if I can directly read data from S3 as I am doing for all other map-reduce jobs. While reading from S3, I don't see any errors, its just that all files generated are of 0 bytes as you can see in the image attached above.