This crash happens in ChunkDocument.tryParseStream() when buffer[writeCursor++] = (byte)c; attempts to write past the end of buffer. The size of buffer was 256k, based on the assertion that gov2 documents are truncated at 256KB.
The document that causes the crash has length 357895. This document was encountered while processing GX229-1000-1500.chunk, which was a version of GX229.chunk that was filtered by BitFunnel to contain documents with unique posting counts from 1000 to 1500.
Some observations:
BitFunnel was able to read GX229.chunk in order to to generate the filtered chunk GX229-1000-1500.chunk. This suggests that GX229.chunk, which was created by mg4j, was well formatted, even if it contained long documents.
In the chunk processing pipeline, mg4j generated GX229.chunk, but never attempted to read it.
My leading theory is that the original gov2 GX229 directory contains a bundle (.txt file) with a document, which tikka represents as longer than 256k.
This crash happens in ChunkDocument.tryParseStream() when
buffer[writeCursor++] = (byte)c;
attempts to write past the end ofbuffer
. The size of buffer was 256k, based on the assertion that gov2 documents are truncated at 256KB.The document that causes the crash has length 357895. This document was encountered while processing
GX229-1000-1500.chunk
, which was a version ofGX229.chunk
that was filtered by BitFunnel to contain documents with unique posting counts from 1000 to 1500.Some observations:
GX229.chunk
in order to to generate the filtered chunkGX229-1000-1500.chunk
. This suggests thatGX229.chunk
, which was created by mg4j, was well formatted, even if it contained long documents.GX229.chunk
, but never attempted to read it.My leading theory is that the original gov2
GX229
directory contains a bundle (.txt file) with a document, which tikka represents as longer than 256k.