apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.45k stars 973 forks source link

DataOutput.writeGroupVInts throws IntegerOverflow exception during merging #13373

Closed iamsanjay closed 1 month ago

iamsanjay commented 1 month ago

Description

As being discussed on email list that DataOutput.writeGroupVInts throws as IntegerOverflow exception. The goal is to find out the main reason and also to improve the exception message.

Exception in thread "Lucene Merge Thread #202"
org.apache.lucene.index.MergePolicy$MergeException:
java.lang.ArithmeticException: integer overflow at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735) at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
Caused by: java.lang.ArithmeticException: integer overflow at 
java.base/java.lang.Math.toIntExact(Math.java:1135) at 
org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354) at 
org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379) at 
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173) at 
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097) at
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398) at 
org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95) at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:209) at
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:298) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:137) at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5252) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4740) at
org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6541) at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:639) at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:700)

More context from the reporter

Looking deeper into this. I think we overflowed a term frequency field. Looking in some statistics, in a previous release we had 1,288,526,281 of a certain field, this would be larger now. Each of these would have had a limited set of values. But crucially nearly all of them would have had the term "positional" or "non-positional" added to the document.

There is no good reason to do this today, we should just turn this into a boolean field and update the UI. I will do this and report back.

Do you think that a patch for a try/catch for a more informative log message be appreciated by the community? e.g. mentioning the field name in the exception?

The index that had an issue when merging into one segment definitely had more than 1 billion times the word "positional" in it. I hope to be able to give a closer number once re-indexing finished with a "work-around".

Of course the "work-around" is to just fix this correctly by not having that word so often in the index and definitely not as docs, freqs and postings.

For background information.

The use case was to find a set of documents that where either "positional" or "non-positional". This was present in the first check in of our code 18 years ago! since then our data has grown a bit ;) The code was using Lucene 1.4.3 at that time. Users would search using this as what now would be a facet type:positional. I changed this to a field only IndexOptions.DOCS which is called 'positional' and searched as positional:yes rewriting the previous query syntax behind the scene to not break any user tools.

Version and environment details

No response

iamsanjay commented 1 month ago

Below code snippet is from 9_10 branch where this issues has been observed. As per the latest change for 10, we have moved few set of lines from below method to other class into a new method. java.base/java.lang.Math.toIntExact(Math.java:1135) at org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354) at

https://github.com/apache/lucene/blob/f12e4899bf0420693e4f524a515dafcf0f21a3d3/lucene/core/src/java/org/apache/lucene/store/DataOutput.java#L337-L356

easyice commented 1 month ago

Sorry for missing the email list, It seems the docDeltaBuffer should not overflow if just reading the code, I will try to reproduce this issue, Could you show me your source code for indexing, and some sample data? @iamsanjay

JervenBolleman commented 1 month ago

Hi @easyice, I am the original reporter on the mailing list.

As the code around indexing is a bit abstracted it might be hard to follow. What I do have, is the index that failed merging it is however, 173 GB xz compressed. I could use luke or a tool like that to extract more information for the lucene team.

The fieldtype that we are indexing into is

UNSTORED_POSITIONAL.setOmitNorms(true);
UNSTORED_POSITIONAL.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
UNSTORED_POSITIONAL.setStored(false);
UNSTORED_POSITIONAL.setTokenized(false);
UNSTORED_POSITIONAL.freeze();```

Then we add fields like so

doc.add(new Field("type", value.toLowerCase(Locale.US), UNSTORED_POSITIONAL);

With over 1,177,800,000 documents in this index, all with the term "positional" at least once in their documents. On average there are three fields of this type in each document.

So to create local sample data I would just do ;)

for (int i=0;i<2_000_000_000;i++){
{
    Document doc = new Document();
    doc.add(new Field("type", "number", UNSTORED_POSITIONAL);
    if (i % 2 == 0} {
        doc.add(new Field("type", "even", UNSTORED_POSITIONAL);
    } else {
        doc.add(new Field("type", "un-even", UNSTORED_POSITIONAL);
   }
   writer.addDocument(doc);
}
easyice commented 1 month ago

Thank you @JervenBolleman , I have found the cause of the issue with @gf2121 , i will raise a PR later.

mikemccand commented 1 month ago

Here is the java-user discussion that lead to this issue.

Thank you for reporting this @iamsanjay! It looks like it was a real bug, phew, and somewhat serious (not sure).

And thank you @easyice and @gf2121 for the quick repro/fix.