apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing [LUCENE-8118] #9166

Open asfimport opened 6 years ago

asfimport commented 6 years ago

Indexing a large collection of about 20 million paragraph-sized documents results in an ArrayIndexOutOfBoundsException in org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace below).

The bug is possibly related to issues described in here and SOLR-10936 – but I am not using SOLR, I am directly using Lucene Core.

The issue can be reproduced using code from GitHub trec-car-tools-example

Where paragraphCorpus.cbor is contained in this archive

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508) at edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)


Migrated from LUCENE-8118 by Laura Dietz, 2 votes, updated Aug 19 2022 Environment:

Debian/Stretch
java version "1.8.0_144"                                                                                                                                                                                       Java(TM) SE Runtime Environment (build 1.8.0_144-b01)                                                                                                                                                          Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

Attachments: LUCENE-8118_test.patch Linked issues:

asfimport commented 6 years ago

Diego Ceccarelli (migrated from JIRA)

Looking at your code it seems that there is only one commit at the end, and your collection is big. What if you try to commit every, let's say, 50k docs?

asfimport commented 6 years ago

Laura Dietz (migrated from JIRA)

Yes, that works - Thanks, Diego!

I think I could have been helped with an Exception message that indicates "Buffer full, call index.commit!"

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Issuing unnecessary commits is just masking the issue: you shouldn't see this exception.

asfimport commented 6 years ago

Diego Ceccarelli (migrated from JIRA)

I agree, that was just a workaround for Laura Dietz :)

asfimport commented 6 years ago

Laura Dietz (migrated from JIRA)

Robert, that would be even better!

It is difficult to guess what the right interval of issuing a commits is. I understand that some hand tuning might be necessary to get the highest performance for given resource constraints. If the issue is a buffer that is filling up, it would be helpful to have some form of an emergency auto-commit.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It is nothing like that, it is simply a bug.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well, I understand the bug, but not sure what the fix is.

Indexing code implements Iterable etc to pull in the docs, and makes one single call to addDocuments().

This is supposed to be an "atomic add" of multiple documents at once which gives certain guarantees: needed for nested documents and features like that so they document IDs will be aligned in a particular way.

In your case, its too much data, IndexWriter isn't going to be able to do 200M docs in one operation like this.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

whatever we decide to do, we can be sure that AIOOBE is not the right answer :)

asfimport commented 6 years ago

Laura Dietz (migrated from JIRA)

I think my mistake was to abuse addDocuments(iterator).

I switched to addDocument(doc) with a commit every so often (see master branch)

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yeah, but we still need to fix the case where someone passes too many documents for addDocuments to succeed: it needs to be better than AIOOBE.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's a really bad test, but it works (takes about 2 minutes).

lucene/core$ ant test -Dtestcase=TestIndexWriter -Dtestmethod=testAddDocumentsMassive -Dtests.heapsize=4G

  [junit4] <JUnit4> says مرحبا! Master seed: 1655BF16A8843A6A
   [junit4] Executing 1 suite with 1 JVM.
   [junit4] 
   [junit4] Started J0 PID(22813@localhost).
   [junit4] Suite: org.apache.lucene.index.TestIndexWriter
   [junit4] HEARTBEAT J0 PID(22813@localhost): 2018-01-05T11:27:48, stalled for 71.2s at: TestIndexWriter.testAddDocumentsMassive
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriter -Dtests.method=testAddDocumentsMassive -Dtests.seed=1655BF16A8843A6A -Dtests.locale=fr-FR -Dtests.timezone=Asia/Oral -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR    121s | TestIndexWriter.testAddDocumentsMassive <<<
   [junit4]    > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: -65536
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([1655BF16A8843A6A:2B0B86082D338FEA]:0)
   [junit4]    >    at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
   [junit4]    >    at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
   [junit4]    >    at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
   [junit4]    >    at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
   [junit4]    >    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
   [junit4]    >    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
   [junit4]    >    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
   [junit4]    >    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
   [junit4]    >    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
   [junit4]    >    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:452)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1530)
   [junit4]    >    at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1506)
   [junit4]    >    at org.apache.lucene.index.TestIndexWriter.testAddDocumentsMassive(TestIndexWriter.java:2994)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: leaving temporary files on disk at: /home/rmuir/workspace/lucene-solr/lucene/build/core/test/J0/temp/lucene.index.TestIndexWriter_1655BF16A8843A6A-001
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@2a213e8b), locale=fr-FR, timezone=Asia/Oral
   [junit4]   2> NOTE: Linux 4.4.0-104-generic amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=543468648,total=2733637632
   [junit4]   2> NOTE: All tests run in this JVM: [TestIndexWriter]
   [junit4] Completed [1/1 (1!)] in 121.55s, 1 test, 1 error <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures [seed: 1655BF16A8843A6A]:
   [junit4]   - org.apache.lucene.index.TestIndexWriter.testAddDocumentsMassive
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.38 ..   122.56 =   122.18s
   [junit4] Execution time total: 2 minutes 2 seconds
   [junit4] Tests summary: 1 suite, 1 test, 1 error

BUILD FAILED
/home/rmuir/workspace/lucene-solr/lucene/common-build.xml:1512: The following error occurred while executing this line:
/home/rmuir/workspace/lucene-solr/lucene/common-build.xml:1038: There were test failures: 1 suite, 1 test, 1 error [seed: 1655BF16A8843A6A]

Total time: 2 minutes 5 seconds
asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

the test had to work hard to hit AIOOBE instead of OOM.

I think most users that do something like this will hit OOM which is just as confusing and bad. it may technically be a different problem but due to the names of the methods and the apis, i think its easy someone will hit it too. Seems like add/updateDocuments need some sanity checks...

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Note that committing only once at the end is entirely normal and often exactly the right choice.

It's hard to know how to fix this – we could add a best effort check that if the RAM usage of that one in-memory segment (DWPT) exceeds the hard limit (IWC.setRAMPerThreadHardLimitMB) we throw a better exception?

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Well, I think a simple limit can work. For this API, e.g a simple counter, throw exc if the iterator has over 100k docs.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Well, I think a simple limit can work. For this API, e.g a simple counter, throw exc if the iterator has over 100k docs.

+1

asfimport commented 6 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

OOMs are complicated in general because once you hit one, there's a very real risk that you won't be able to recover anyway (even constructing a new exception message typically requires memory allocation and this just goes on and on in a vicious cycle). I remember thinking about it a lot in the early days of randomizedrunner, but without any constructive conclusions. I tried preallocating stuff in advance (not possible in all cases) and workarounds like keeping a memory buffer that is made reclaimable on OOM (so that there's some memory available before we hit the next one)... these are hacks more than solutions and they don't always work anyway (as in when you have background heap-competing threads...).

I like Java, but it starts to show its wrinkles. :(

asfimport commented 6 years ago

Laura Dietz (migrated from JIRA)

Dawid, Michael, my computer has plenty of RAM, which is why I never see an OOM exception and always get the AIOOBE.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Dawid it is not complicated in this case. It is trivial to fix.

Again to explain:

There are a couple problems here. First is the method's name (addDocuments is not the plural form of addDocument, its something totally different alltogether. It needs to be addDocumentsAtomic or addDocumentsBlock or something else, anything else. Its also missing bounds checks which is why you see the AIOOBE, those need to be added.

asfimport commented 6 years ago

Laura Dietz (migrated from JIRA)

+1

asfimport commented 3 years ago

Nam-Quang Tran (migrated from JIRA)

Here's another stacktrace, but slightly different from the one in the original post. My crash happens not with addDocuments, but with addDocument. Maximum heap size is 16 GB. Lucene version is 8.5.2.

Note: I got this stacktrace from a user, and have no access to the data being indexed, so my testing ability is very limited.

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:695) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1591) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) at com.docfetcherpro.model.TreeFolderWrapper$.addDoc(TreeModelWrapper.scala:189) at com.docfetcherpro.model.TreeNodeWrapper.addDoc(TreeModelWrapper.scala:533) at com.docfetcherpro.model.TreeNodeWrapper.update(TreeModelWrapper.scala:315) at com.docfetcherpro.model.TreeUpdate$.updateNodePair(TreeUpdate.scala:333) at com.docfetcherpro.model.TreeUpdate$.$anonfun$update$6(TreeUpdate.scala:137) at com.docfetcherpro.model.TreeUpdate$.$anonfun$update$6$adapted(TreeUpdate.scala:133) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) at scala.collection.immutable.List.foreach(List.scala:431) at scala.collection.generic.TraversableForwarder.foreach(TraversableForwarder.scala:38) at scala.collection.generic.TraversableForwarder.foreach$(TraversableForwarder.scala:38) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) at com.docfetcherpro.model.TreeUpdate$.update(TreeUpdate.scala:133) at com.docfetcherpro.model.IndexActor.index1(IndexActor.scala:127) at com.docfetcherpro.model.IndexActor.$anonfun$index$1(IndexActor.scala:18) at com.docfetcherpro.util.MethodActor$$anon$3.run(MethodActor.scala:86) at com.docfetcherpro.util.MethodActor.com$docfetcherpro$util$MethodActor$$threadLoop(MethodActor.scala:185) at com.docfetcherpro.util.MethodActor$$anon$2.run(MethodActor.scala:67) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:121) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:178) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:862) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) at com.docfetcherpro.model.TreeFolderWrapper$.addDoc(TreeModelWrapper.scala:189) at com.docfetcherpro.model.TreeNodeWrapper.addDoc(TreeModelWrapper.scala:533) at com.docfetcherpro.model.TreeNodeWrapper.update(TreeModelWrapper.scala:310) ... 15 more

asfimport commented 3 years ago

Nam-Quang Tran (migrated from JIRA)

Update on my previous post: After some back and forth with the reporting user, it turned out that the crash was caused by some specific files named "delete.pdf". Unfortunately, the user deleted these files before I could get my hands on them. But sharing them may have been problematic anyway, as they were some old bank statements. In any case, it seems that feeding the contents of some bad PDF files to Lucene can cause AIOOBEs.

asfimport commented 3 years ago

Nam-Quang Tran (migrated from JIRA)

Now I had another user report to me the crash I posted above. This time I was able to pinpoint the files that were causing the crash. They were just large binary files (> 2 GB). So, I think the crash can now be reproduced relatively easily: Set a high maximum heap size (16 GB in my case), then shove a giant binary file into Lucene.

asfimport commented 2 years ago

Luís Filipe Nassif (migrated from JIRA)

Hi, a colleague of mine pointed this to me. Should I close https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?

We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with \~10MB sized docs. Is there a known upper bound for numDocs x docSize given to addDocumentS()?

PS: possibly there were other documents being indexed in parallel by other threads

PS2: our default commit time interval is 30min

PS3: I changed our application from addDocument() to addDocumentS() in part because of the nice atomic guarantees and because we have to have all text chunks children of one parent document. If we have to call addDocumentS() multiple times with smaller iterables, possibly we will have to implement the parent-children control by ourselves (as we did in the past with the first method)... or not?

asfimport commented 2 years ago

Luís Filipe Nassif (migrated from JIRA)

I did some workarounds in our project to avoid this. But as suggested before, it would be nice to throw a better Exception and update the IndexWriter.addDocumentS(Iterable) documentation to warn users they shouldn't pass huge collections or documents to that method at once.