fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

FileSystem.get() calls should pass a URI in order to be easily portable to other systems #19

Closed cfstout closed 9 years ago

cfstout commented 9 years ago

I've been working on porting some of this code up to AWS's elastic map reduce framework and have found a bug with the way we are setting paths. Instead of calls to FileSystem.get(job.getConfirguration()), we should pass the optional URI parameter as FileSystem.get(inputPath.toUri(), job.getConfiguration()) to be more robust to other FileSystems (local, s3, hdfs, etc).

If you agree that this is worthwhile, I'm happy to submit a PR with the change.

bvanberg commented 9 years ago

Which versions of AWS EMR are you running on? We run all of our jobs on AWS EMR. Are you reading directly from S3 rather than HDFS?

cfstout commented 9 years ago

Yes I'm reading directly from S3. The default is to read from HDFS which would fix the bug. I think I could also configure the job to use S3 as default, but the error message: You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path. seems to indicate that the preferred method is to pass in the uri.

bvanberg commented 9 years ago

Cool. Please submit a PR and we'll get it rolled in. Thanks.