Open asfimport opened 6 years ago
Diego Ceccarelli (migrated from JIRA)
Looking at your code it seems that there is only one commit at the end, and your collection is big. What if you try to commit every, let's say, 50k docs?
Laura Dietz (migrated from JIRA)
Yes, that works - Thanks, Diego!
I think I could have been helped with an Exception message that indicates "Buffer full, call index.commit!"
Robert Muir (@rmuir) (migrated from JIRA)
Issuing unnecessary commits is just masking the issue: you shouldn't see this exception.
Diego Ceccarelli (migrated from JIRA)
I agree, that was just a workaround for Laura Dietz :)
Laura Dietz (migrated from JIRA)
Robert, that would be even better!
It is difficult to guess what the right interval of issuing a commits is. I understand that some hand tuning might be necessary to get the highest performance for given resource constraints. If the issue is a buffer that is filling up, it would be helpful to have some form of an emergency auto-commit.
Robert Muir (@rmuir) (migrated from JIRA)
It is nothing like that, it is simply a bug.
Robert Muir (@rmuir) (migrated from JIRA)
Well, I understand the bug, but not sure what the fix is.
Indexing code implements Iterable etc to pull in the docs, and makes one single call to addDocuments().
This is supposed to be an "atomic add" of multiple documents at once which gives certain guarantees: needed for nested documents and features like that so they document IDs will be aligned in a particular way.
In your case, its too much data, IndexWriter isn't going to be able to do 200M docs in one operation like this.
Robert Muir (@rmuir) (migrated from JIRA)
whatever we decide to do, we can be sure that AIOOBE is not the right answer :)
Laura Dietz (migrated from JIRA)
I think my mistake was to abuse addDocuments(iterator).
I switched to addDocument(doc) with a commit every so often (see master branch)
Robert Muir (@rmuir) (migrated from JIRA)
yeah, but we still need to fix the case where someone passes too many documents for addDocuments to succeed: it needs to be better than AIOOBE.
Robert Muir (@rmuir) (migrated from JIRA)
Here's a really bad test, but it works (takes about 2 minutes).
lucene/core$ ant test -Dtestcase=TestIndexWriter -Dtestmethod=testAddDocumentsMassive -Dtests.heapsize=4G
[junit4] <JUnit4> says مرحبا! Master seed: 1655BF16A8843A6A
[junit4] Executing 1 suite with 1 JVM.
[junit4]
[junit4] Started J0 PID(22813@localhost).
[junit4] Suite: org.apache.lucene.index.TestIndexWriter
[junit4] HEARTBEAT J0 PID(22813@localhost): 2018-01-05T11:27:48, stalled for 71.2s at: TestIndexWriter.testAddDocumentsMassive
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriter -Dtests.method=testAddDocumentsMassive -Dtests.seed=1655BF16A8843A6A -Dtests.locale=fr-FR -Dtests.timezone=Asia/Oral -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] ERROR 121s | TestIndexWriter.testAddDocumentsMassive <<<
[junit4] > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: -65536
[junit4] > at __randomizedtesting.SeedInfo.seed([1655BF16A8843A6A:2B0B86082D338FEA]:0)
[junit4] > at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
[junit4] > at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
[junit4] > at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80)
[junit4] > at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:171)
[junit4] > at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185)
[junit4] > at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
[junit4] > at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
[junit4] > at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
[junit4] > at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
[junit4] > at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:452)
[junit4] > at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1530)
[junit4] > at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1506)
[junit4] > at org.apache.lucene.index.TestIndexWriter.testAddDocumentsMassive(TestIndexWriter.java:2994)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] 2> NOTE: leaving temporary files on disk at: /home/rmuir/workspace/lucene-solr/lucene/build/core/test/J0/temp/lucene.index.TestIndexWriter_1655BF16A8843A6A-001
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70), sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@2a213e8b), locale=fr-FR, timezone=Asia/Oral
[junit4] 2> NOTE: Linux 4.4.0-104-generic amd64/Oracle Corporation 1.8.0_45 (64-bit)/cpus=8,threads=1,free=543468648,total=2733637632
[junit4] 2> NOTE: All tests run in this JVM: [TestIndexWriter]
[junit4] Completed [1/1 (1!)] in 121.55s, 1 test, 1 error <<< FAILURES!
[junit4]
[junit4]
[junit4] Tests with failures [seed: 1655BF16A8843A6A]:
[junit4] - org.apache.lucene.index.TestIndexWriter.testAddDocumentsMassive
[junit4]
[junit4]
[junit4] JVM J0: 0.38 .. 122.56 = 122.18s
[junit4] Execution time total: 2 minutes 2 seconds
[junit4] Tests summary: 1 suite, 1 test, 1 error
BUILD FAILED
/home/rmuir/workspace/lucene-solr/lucene/common-build.xml:1512: The following error occurred while executing this line:
/home/rmuir/workspace/lucene-solr/lucene/common-build.xml:1038: There were test failures: 1 suite, 1 test, 1 error [seed: 1655BF16A8843A6A]
Total time: 2 minutes 5 seconds
Robert Muir (@rmuir) (migrated from JIRA)
the test had to work hard to hit AIOOBE instead of OOM.
I think most users that do something like this will hit OOM which is just as confusing and bad. it may technically be a different problem but due to the names of the methods and the apis, i think its easy someone will hit it too. Seems like add/updateDocuments need some sanity checks...
Michael McCandless (@mikemccand) (migrated from JIRA)
Note that committing only once at the end is entirely normal and often exactly the right choice.
It's hard to know how to fix this – we could add a best effort check that if the RAM usage of that one in-memory segment (DWPT) exceeds the hard limit (IWC.setRAMPerThreadHardLimitMB
) we throw a better exception?
Robert Muir (@rmuir) (migrated from JIRA)
Well, I think a simple limit can work. For this API, e.g a simple counter, throw exc if the iterator has over 100k docs.
Michael McCandless (@mikemccand) (migrated from JIRA)
Well, I think a simple limit can work. For this API, e.g a simple counter, throw exc if the iterator has over 100k docs.
+1
Dawid Weiss (@dweiss) (migrated from JIRA)
OOMs are complicated in general because once you hit one, there's a very real risk that you won't be able to recover anyway (even constructing a new exception message typically requires memory allocation and this just goes on and on in a vicious cycle). I remember thinking about it a lot in the early days of randomizedrunner, but without any constructive conclusions. I tried preallocating stuff in advance (not possible in all cases) and workarounds like keeping a memory buffer that is made reclaimable on OOM (so that there's some memory available before we hit the next one)... these are hacks more than solutions and they don't always work anyway (as in when you have background heap-competing threads...).
I like Java, but it starts to show its wrinkles. :(
Laura Dietz (migrated from JIRA)
Dawid, Michael, my computer has plenty of RAM, which is why I never see an OOM exception and always get the AIOOBE.
Robert Muir (@rmuir) (migrated from JIRA)
Dawid it is not complicated in this case. It is trivial to fix.
Again to explain:
There are a couple problems here. First is the method's name (addDocuments is not the plural form of addDocument, its something totally different alltogether. It needs to be addDocumentsAtomic or addDocumentsBlock or something else, anything else. Its also missing bounds checks which is why you see the AIOOBE, those need to be added.
Laura Dietz (migrated from JIRA)
+1
Nam-Quang Tran (migrated from JIRA)
Here's another stacktrace, but slightly different from the one in the original post. My crash happens not with addDocuments, but with addDocument. Maximum heap size is 16 GB. Lucene version is 8.5.2.
Note: I got this stacktrace from a user, and have no access to the data being indexed, so my testing ability is very limited.
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:695) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1591) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) at com.docfetcherpro.model.TreeFolderWrapper$.addDoc(TreeModelWrapper.scala:189) at com.docfetcherpro.model.TreeNodeWrapper.addDoc(TreeModelWrapper.scala:533) at com.docfetcherpro.model.TreeNodeWrapper.update(TreeModelWrapper.scala:315) at com.docfetcherpro.model.TreeUpdate$.updateNodePair(TreeUpdate.scala:333) at com.docfetcherpro.model.TreeUpdate$.$anonfun$update$6(TreeUpdate.scala:137) at com.docfetcherpro.model.TreeUpdate$.$anonfun$update$6$adapted(TreeUpdate.scala:133) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985) at scala.collection.immutable.List.foreach(List.scala:431) at scala.collection.generic.TraversableForwarder.foreach(TraversableForwarder.scala:38) at scala.collection.generic.TraversableForwarder.foreach$(TraversableForwarder.scala:38) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:47) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984) at com.docfetcherpro.model.TreeUpdate$.update(TreeUpdate.scala:133) at com.docfetcherpro.model.IndexActor.index1(IndexActor.scala:127) at com.docfetcherpro.model.IndexActor.$anonfun$index$1(IndexActor.scala:18) at com.docfetcherpro.util.MethodActor$$anon$3.run(MethodActor.scala:86) at com.docfetcherpro.util.MethodActor.com$docfetcherpro$util$MethodActor$$threadLoop(MethodActor.scala:185) at com.docfetcherpro.util.MethodActor$$anon$2.run(MethodActor.scala:67) Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 71428 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:221) at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:80) at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:121) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:178) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:862) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) at com.docfetcherpro.model.TreeFolderWrapper$.addDoc(TreeModelWrapper.scala:189) at com.docfetcherpro.model.TreeNodeWrapper.addDoc(TreeModelWrapper.scala:533) at com.docfetcherpro.model.TreeNodeWrapper.update(TreeModelWrapper.scala:310) ... 15 more
Nam-Quang Tran (migrated from JIRA)
Update on my previous post: After some back and forth with the reporting user, it turned out that the crash was caused by some specific files named "delete.pdf". Unfortunately, the user deleted these files before I could get my hands on them. But sharing them may have been problematic anyway, as they were some old bank statements. In any case, it seems that feeding the contents of some bad PDF files to Lucene can cause AIOOBEs.
Nam-Quang Tran (migrated from JIRA)
Now I had another user report to me the crash I posted above. This time I was able to pinpoint the files that were causing the crash. They were just large binary files (> 2 GB). So, I think the crash can now be reproduced relatively easily: Set a high maximum heap size (16 GB in my case), then shove a giant binary file into Lucene.
Luís Filipe Nassif (migrated from JIRA)
Hi, a colleague of mine pointed this to me. Should I close https://issues.apache.org/jira/browse/LUCENE-10681 as duplicate?
We hit this AIOOBE in the 640th iteration of addDocumentS(Iterable) with \~10MB sized docs. Is there a known upper bound for numDocs x docSize given to addDocumentS()?
PS: possibly there were other documents being indexed in parallel by other threads
PS2: our default commit time interval is 30min
PS3: I changed our application from addDocument() to addDocumentS() in part because of the nice atomic guarantees and because we have to have all text chunks children of one parent document. If we have to call addDocumentS() multiple times with smaller iterables, possibly we will have to implement the parent-children control by ourselves (as we did in the past with the first method)... or not?
Luís Filipe Nassif (migrated from JIRA)
I did some workarounds in our project to avoid this. But as suggested before, it would be nice to throw a better Exception and update the IndexWriter.addDocumentS(Iterable) documentation to warn users they shouldn't pass huge collections or documents to that method at once.
Indexing a large collection of about 20 million paragraph-sized documents results in an ArrayIndexOutOfBoundsException in org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace below).
The bug is possibly related to issues described in here and SOLR-10936 – but I am not using SOLR, I am directly using Lucene Core.
The issue can be reproduced using code from GitHub trec-car-tools-example
mvn compile assembly:single
java -cp ./target/treccar-tools-example-0.1-jar-with-dependencies.jar edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir
Where paragraphCorpus.cbor is contained in this archive
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) at org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1508) at edu.unh.cs.TrecCarBuildLuceneIndex.main(TrecCarBuildLuceneIndex.java:55)
Migrated from LUCENE-8118 by Laura Dietz, 2 votes, updated Aug 19 2022 Environment:
Attachments: LUCENE-8118_test.patch Linked issues: