fullcontact / hadoop-sstable

Splittable Input Format for Reading Cassandra SSTables Directly
Apache License 2.0
49 stars 14 forks source link

Missing records from the SSTable files #15

Closed java8964 closed 9 years ago

java8964 commented 9 years ago

I am testing with 0.1.2 release. Here is the problem I am not sure where it comes from.

I tested with 0.1.2. I have our old implementation, which parse the SSTable file per Mapper. In my test case from one set of production data, we have 375 SSTable files. In my old implementation, which use the same Cassandra SSTable exporter logic, my code will generate 161,313,791,210 records from these SSTables. Using hadoop-sstables, I use a new mapper (which get the (key/SSTableIdentityIterator) from hadoop-sstables, but only got 161,304,497,154 total records from the same 375 SSTable files, so there are 9,294,056 records missing. In the old implementation, it should be bug free on the total count of records, as it is verified by using the Cassandra (sstable2json) tools. So is it possible the indexing and split the SSTable files COULD lost any records? I am going to test one individual SSTable file in next step, but just want to know if any suggestion about this case?

Thanks

Xorlev commented 9 years ago

Hi Yong,

No, it's not possible that indexing changed your files. The whole operation is read only. Indexing reads your sstable index files and writes new "index index" files.

The "split" step doesn't change the files, it uses the index index files to find offsets into them and passes that along to the mappers as InputSplits. That being said, it is possible that the code is skipping over data somewhere, so we're very interested to figure out where that's coming from.

Do you have any idea if those 9.2M records were different in any way?

java8964 commented 9 years ago

I can reproduce the count difference using one SSTable file. So I like to seek your help tracking this issue.

Our data is composite key + composite column names data. I choose one example SSTable files including -Data.db file about 1190323388 bytes, plus CompressionInfor.db, Index.db, Filter.db, Summary.db, Statistics.db.

The sstable2json, parsing the Data.db file, will output 195238 as count of row key. If I iterator the columns, I will get 46167243 columns from these row keys.

Now, I built the index of the above set data. After that, I wrote the following unit test code:

    public void testSSTableRowInputFormat() throws Exception {
        long keyCnt = 0;
        long recordCnt = 0;
        Properties props = new Properties();          
props.load(this.getClass().getClassLoader().getResourceAsStream("t.properties"));
        Configuration conf = new Configuration();
        conf.set(HadoopSSTableConstants.HADOOP_SSTABLE_CQL, props.getProperty("cassandra.table.ddl"));
        Job job = new Job(conf);
        SSTableRowInputFormat ssTableRowInputFormat = new SSTableRowInputFormat();
        ssTableRowInputFormat.addInputPath(job, new Path("/folder/"));
        for (InputSplit inputSplit : ssTableRowInputFormat.getSplits(job)) {
            SSTableSplit sstableSplit = (SSTableSplit) inputSplit;
            TaskAttemptContext context = new TaskAttemptContext(conf, TaskAttemptID.forName("attempt_200707121733_0001_m_000000_0"));
            RecordReader<ByteBuffer, SSTableIdentityIterator> recordReader = ssTableRowInputFormat.createRecordReader(inputSplit, context);
            recordReader.initialize(inputSplit, context);
            while (recordReader.nextKeyValue()) {
                keyCnt++;
                SSTableIdentityIterator sii = recordReader.getCurrentValue();
                while (sii.hasNext()) {
                    recordCnt++;
                    sii.next();
                }
            }
        }
        System.out.println("keyCnt = " + keyCnt);
        System.out.println("recordCnt = " + recordCnt);
    }

The output is keyCnt = 195234 recordCnt = 46167221

So using hadoop-sstable, it looks like I lost 4 row keys, and 22 columns. I have the -Index.db and -Index.db.Index files, but not sure how internal you are using the index file to generate the split and parsing the -Data.db file. Any help for debugging this issue?

Thanks

java8964 commented 9 years ago

I added the "split.getStart() + ":" + split.getEnd()" in the split loop, here is the output:

0:1073769762 1073771138:2147788988 2147789179:3222478600 3222478734:3485271831

It looks like the file is splitted into 4 splits, here are my questions:

1) The -Data.db file itself is only 1,190,323,388 length, why the split offset bytes reach to 3,485,271,831? 2) From the splits to splits, there are some gap. Is that normal? For example, the first split ends on 1,073,769,762, but the 2nd split starts at 1,073,771,138.

Thanks

Yong

bvanberg commented 9 years ago

Hi Yong,

Good question.

This has to do with compression. -Data.db is a compressed file. -Index.db is an index into the uncompressed data. Splits are generated from the -Index.db file.

Because the splits are indices into the uncompressed data, it follows that the data must be decompressed to leverage the splits. This is where the -CompressionInfo.db file comes in. This file contains information about the compressed blocks in the -Data.db file. This allows us to read the compressed data file as if it were uncompressed. Clear as mud? Fortunately we don't have to worry about these details as the C* i/o code handles the decompression for us and we just read the files as if they were uncompressed.

Given all of that your splits should map to valid indices found in your -Index.db files. If you suspect that the -Index.db.index files are somehow incorrect you can validate against the -Index.db directly, but not the -Data.db file.

bvanberg commented 9 years ago

Additionally, the gap is normal. This is because we are generating splits from the Index.db which has a bunch of offsets into the data. If you inspect the Index.db you'll find that the splits account for all of the offsets contained within.

java8964 commented 9 years ago

Do you have any hint about how the 4 row keys is not available from the ssTableRowInputFormat? Or what additional steps I can take to see why these 4 rows keys missed?

Thanks

bvanberg commented 9 years ago

Given that you are short 4 row keys, and you generated 4 splits there could be an issue there. You should be able to validate that your splits fully cover your Index.db offsets. i.e. Every offset contained within Index.db can be accounted for by the split ranges.

java8964 commented 9 years ago

Hi,

After some debugging, I think I identify the bug.

In the file https://github.com/fullcontact/hadoop-sstable/blob/master/sstable-core/src/main/java/com/fullcontact/sstable/hadoop/mapreduce/SSTableRecordReader.java

On the line 115, it should be

    protected boolean hasMore() {
        return reader.getFilePointer() <= split.getEnd();
    }

instead of

    protected boolean hasMore() {
        return reader.getFilePointer() < split.getEnd();
    }

The reason is that when the code generate the split, the gap between the split bounder is OK, but there will be one row key data after the boundary. So in the hasMore() logic, if we are using '<' instead of '<=', it will lost this one row key data after the boundary.

In my example data, which use the default 1G as the split size, so for splits: 0:1073769762 1073771138:2147788988 2147789179:3222478600 3222478734:3485271831

When the offset reaches 1073769762, since we are using '<' as the logic in hasMore(), it will return false and lost the row key, which is exactly located between 1073769762 and 1073771138. I want to write a unit test for SSTableRecordReader class, if you give me the CQL of the test data of /data/Keyspace1-Standard1-ic-0-xxx.

I am happy to submit a pull request with a unit test if I can have the CQL of the data SSTable files.

Thanks

bvanberg commented 9 years ago

This is great, thanks. Feel free to PR this at your leisure.

On Tue, Dec 16, 2014 at 11:41 AM, Yong Zhang notifications@github.com wrote:

Hi,

After sometime debugging, I think I identify the bug.

In the file

https://github.com/fullcontact/hadoop-sstable/blob/master/sstable-core/src/main/java/com/fullcontact/sstable/hadoop/mapreduce/SSTableRecordReader.java

On the line 115, it should be

protected boolean hasMore() {
    return reader.getFilePointer() <= split.getEnd();
}

instead of

protected boolean hasMore() {
    return reader.getFilePointer() < split.getEnd();
}

The reason is that when the code generate the split, the gap between the split bounder is OK, but there will be one row key data after the gap. So in the hasMore() logic, if we are using '<' instead of '<=', it will lost this one row key data after the boundary.

In my example data, which use the default 1G as the split size, so for splits: 0:1073769762 1073771138:2147788988 2147789179:3222478600 3222478734:3485271831

When the offset reaches 1073769762, since we are using '<' as the logic in hasMore(), it will return false and lost the row key, which is exactly located between 1073769762 and 1073771138. I want to write a unit test for SSTableRecordReader class, if you give me the CQL of the test data of /data/Keyspace1-Standard1-ic-0-xxx.

I am happy to submit a pull request with a unit test if I can have the CQL of the data SSTable files.

Thanks

— Reply to this email directly or view it on GitHub https://github.com/fullcontact/hadoop-sstable/issues/15#issuecomment-67209219 .

bvanberg commented 9 years ago

I applied this fix in a recent PR. Thanks again!