fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

Problems with readIndex(FileSystem, Path) method of SSTableIndexIndex #20

Open cfstout opened 9 years ago

cfstout commented 9 years ago

I have been using this code to create a MR job to run on AWS's elastic map reduce framework, and it seems that there might be a bug in the readIndex(final FileSystem fileSystem, final Path sstablePath) method. When we open the index using the nativeS3FileSystem, whenever we call inputStream.available() the response is 0. I think the problem is due to the implementation of these inputStream objects, and not necessarily a problem with this repo's code itself. I have managed to fix the issue by moving the code into a while(true) loop and breaking on an EOFException, which though very hacky seems to work.

I'm not sure if there is a better solution to the problem, or if it's really an artifact of a bug upstream, but thought I'd mention it here so others are aware.

bvanberg commented 9 years ago

Thanks for pointing this out. When I get a chance I'll look into this on our side.

cfstout commented 9 years ago

Also, another strange issue with this area of code-- we have an Index.db file that's 340MB, which is causing an OOM error in this section of code. We're actually working on the 2.0 WIP branch, so might be something to consider looking at for that support. Basically it seems like the issue is creating large arrays of Longs that are using up heap space. I don't know about SSTable particulars to know if there is any way around this though.

bvanberg commented 9 years ago

We have a working branch for 2.0.9 internally. The sstable format changed enough from 1.2 that it required us to change how we're parsing and reading the sstables. I'll make sure we have the latest committed here for others to use.

bvanberg commented 9 years ago

Please try the cassandra-2.0.x branch.