Too many open files in TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader [LUCENE-10088]

asfimport commented 3 years ago

This build failure reproduces for me. I'll try to dig.

Migrated from LUCENE-10088 by Michael McCandless (@mikemccand), updated Apr 28 2022

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

   >     com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=38, name=Thread-4, state=RUNNABLE, group=TGRP-TestIndexWriterMergePolicy]
   >
   >         Caused by:
   >         java.lang.AssertionError: java.nio.file.FileSystemException: /l/trunk/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexWriterMergePolicy_971E1D93B44F0009-001/index-NIOFSDirectory-001/_1r0_Lucene90FieldsIndexfile_pointers_3hy.tmp: Too\
 many open files
   >             at __randomizedtesting.SeedInfo.seed([971E1D93B44F0009]:0)
   >             at org.apache.lucene.index.TestIndexWriterMergePolicy.lambda$stressUpdateSameDocumentWithMergeOnX$4(TestIndexWriterMergePolicy.java:752)
   >             at java.base/java.lang.Thread.run(Thread.java:831)
   >
   >             Caused by:
   >             java.nio.file.FileSystemException: /l/trunk/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexWriterMergePolicy_971E1D93B44F0009-001/index-NIOFSDirectory-001/_1r0_Lucene90FieldsIndexfile_pointers_3hy.tmp: Too many open files
   >                 at org.apache.lucene.mockfile.HandleLimitFS.onOpen(HandleLimitFS.java:45)
   >                 at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:82)
   >                 at org.apache.lucene.mockfile.HandleTrackingFS.newOutputStream(HandleTrackingFS.java:163)
   >                 at org.apache.lucene.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:194)
   >                 at java.base/java.nio.file.Files.newOutputStream(Files.java:224)
   >                 at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:392)
   >                 at org.apache.lucene.store.FSDirectory.createTempOutput(FSDirectory.java:234)
   >                 at org.apache.lucene.store.FileSwitchDirectory.createTempOutput(FileSwitchDirectory.java:198)
   >                 at org.apache.lucene.store.MockDirectoryWrapper.createTempOutput(MockDirectoryWrapper.java:745)
   >                 at org.apache.lucene.store.FilterDirectory.createTempOutput(FilterDirectory.java:81)
   >                 at org.apache.lucene.store.TrackingDirectoryWrapper.createTempOutput(TrackingDirectoryWrapper.java:49)
   >                 at org.apache.lucene.codecs.lucene90.compressing.FieldsIndexWriter.<init>(FieldsIndexWriter.java:87)
   >                 at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.<init>(Lucene90CompressingStoredFieldsWriter.java:148)
   >                 at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsWriter(Lucene90CompressingStoredFieldsFormat.java:140)
   >                 at org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsWriter(Lucene90StoredFieldsFormat.java:154)
   >                 at org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:49)
   >                 at org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:56)
   >                 at org.apache.lucene.index.IndexingChain.startStoredFields(IndexingChain.java:556)
   >                 at org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:587)
   >                 at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:239)
   >                 at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
   >                 at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1530)
   >                 at org.apache.lucene.index.IndexWriter.softUpdateDocument(IndexWriter.java:1847)
   >                 at org.apache.lucene.index.RandomIndexWriter.updateDocument(RandomIndexWriter.java:298)
   >                 at org.apache.lucene.index.TestIndexWriterMergePolicy.lambda$stressUpdateSameDocumentWithMergeOnX$4(TestIndexWriterMergePolicy.java:746)
   >                 ... 1 more
  2> NOTE: reproduce with: gradlew test --tests TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader -Dtests.seed=971E1D93B44F0009 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=qu-EC -Dtests.timezone\
=Asia/Vladivostok -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  2> NOTE: leaving temporary files on disk at: /l/trunk/lucene/core/build/tmp/tests-tmp/lucene.index.TestIndexWriterMergePolicy_971E1D93B44F0009-001
  2> NOTE: test params are: codec=Lucene90, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=qu-EC, timezone=Asia/Vladivostok
  2> NOTE: Linux 5.13.12-arch1-1 amd64/N/A 16.0.2 (64-bit)/cpus=128,threads=1,free=113297584,total=268435456
  2> NOTE: All tests run in this JVM: [TestIndexWriterMergePolicy]

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ugh, adding -Dtests.verbose=true causes the test to pass, of course :) A heisenbug!

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm, another test, on 8.x branch, is also failing from too many open files:

   [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-29697,5,TGRP-TestIndexWriterThreadsToSegments]
   [junit4]   2> java.lang.RuntimeException: java.nio.file.FileSystemException: /home/jenkins/jenkins-slave/workspace/Lucene/Lucene-Solr-NightlyTests-8.x/checkout/lucene/build/core/test/J0/temp/lucene.index.TestIndexWriterThreadsToSegments_FAA2519FF20B50F7-001/tempDir-001/_1kc_Lucene85FieldsIndexfile_pointers_34n.tmp: Too many open files
   [junit4]   2>        at __randomizedtesting.SeedInfo.seed([FAA2519FF20B50F7]:0)
   [junit4]   2>        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:220)
   [junit4]   2> Caused by: java.nio.file.FileSystemException: /home/jenkins/jenkins-slave/workspace/Lucene/Lucene-Solr-NightlyTests-8.x/checkout/lucene/build/core/test/J0/temp/lucene.index.TestIndexWriterThreadsToSegments_FAA2519FF20B50F7-001/tempDir-001/_1kc_Lucene85FieldsIndexfile_pointers_34n.tmp: Too many open files
   [junit4]   2>        at org.apache.lucene.mockfile.HandleLimitFS.onOpen(HandleLimitFS.java:48)
   [junit4]   2>        at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:81)
   [junit4]   2>        at org.apache.lucene.mockfile.HandleTrackingFS.newOutputStream(HandleTrackingFS.java:160)
   [junit4]   2>        at org.apache.lucene.mockfile.FilterFileSystemProvider.newOutputStream(FilterFileSystemProvider.java:197)
   [junit4]   2>        at java.nio.file.Files.newOutputStream(Files.java:216)
   [junit4]   2>        at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:410)
   [junit4]   2>        at org.apache.lucene.store.FSDirectory.createTempOutput(FSDirectory.java:267)
   [junit4]   2>        at org.apache.lucene.store.MockDirectoryWrapper.createTempOutput(MockDirectoryWrapper.java:707)
   [junit4]   2>        at org.apache.lucene.store.FilterDirectory.createTempOutput(FilterDirectory.java:79)
   [junit4]   2>        at org.apache.lucene.store.TrackingDirectoryWrapper.createTempOutput(TrackingDirectoryWrapper.java:51)
   [junit4]   2>        at org.apache.lucene.codecs.compressing.FieldsIndexWriter.<init>(FieldsIndexWriter.java:80)
   [junit4]   2>        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:129)
   [junit4]   2>        at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
   [junit4]   2>        at org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
   [junit4]   2>        at org.apache.lucene.codecs.asserting.AssertingStoredFieldsFormat.fieldsWriter(AssertingStoredFieldsFormat.java:48)
   [junit4]   2>        at org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:48)
   [junit4]   2>        at org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:55)
   [junit4]   2>        at org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:452)
   [junit4]   2>        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:488)
   [junit4]   2>        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
   [junit4]   2>        at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1757)
   [junit4]   2>        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1400)
   [junit4]   2>        at org.apache.lucene.index.TestIndexWriterThreadsToSegments$2.run(TestIndexWriterThreadsToSegments.java:209)
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestIndexWriterThreadsToSegments -Dtests.seed=FAA2519FF20B50F7 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-Solr-NightlyTests-8.x/test-data/enwiki.random.lines.txt -Dtests.locale=et-EE -Dtests.timezone=Australia/Victoria -Dtests.asserts=true -Dtests.file.encoding=UTF-8

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm, I instrumented MockDirectoryWrapper just a bit to print all close after the too many open files exception. There are indeed many open files, but curiously they are all .fdt and .fdx and temp files. They look like this:

  1>   MDW.close: name=_2d1.fdm thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d1.fdt thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d1_Lucene90FieldsIndex-doc_ids_4q0.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d1_Lucene90FieldsIndexfile_pointers_4q1.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d2.fdm thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d2.fdt thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d2_Lucene90FieldsIndex-doc_ids_4q2.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d2_Lucene90FieldsIndexfile_pointers_4q3.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d3.fdm thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d3.fdt thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d3_Lucene90FieldsIndex-doc_ids_4q4.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d3_Lucene90FieldsIndexfile_pointers_4q5.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d4.fdm thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d4.fdt thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d4_Lucene90FieldsIndex-doc_ids_4q6.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]
  1>   MDW.close: name=_2d4_Lucene90FieldsIndexfile_pointers_4q7.tmp thread=Thread[Thread-3,5,TGRP-TestIndexWriterMergePolicy]

In total there are 667 .fdt and .fdx still open, and 1334 temp files. Maybe something about flushing/merging stored fields is leaking file handles?

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Or, maybe, with this seed, this test is legitimately thinking it can make an index with \~667 segments ...

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I dug a bit and I'm seeing almost 600 blocked flushes (DocumentsWriterFlushControl#blockedFlushes) when the exception is raised. I still don't understand why this number goes so high, but it looks to me like this isn't about leaked files, IndexWriter knows about these 600 segments that are pending to be flushed and the reason why we are seeing files for stored fields is because stored fields are the only data structure that gets written on the fly.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ooooh I like that theory @jpountz – so IW is (illegally) allowing too many backlogged flushes to build up. IW normally tries to throttle incoming indexing threads when this happens, so something has gone wrong. Maybe full flush is somehow failing to throttle properly? Thanks for digging @jpountz.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here are some more details on what I'm seeing:

The test starts stressing concurrent updates.
At some point RandomIndexWriter sets maxBufferedDocs to 15, forcing frequent flushes.
One thread calls IndexWriter#getReader, a full flush starts, which sets DocumentsWriterFlushControl#fullFlush as true as a side-effect.
While the full flush is still in progress, the indexing threads keep adding new documents.
Given the low maxBufferedDocs, DWPTs are regularly marked for flush. However since DocumentsWriterFlushControl#fullFlush is true, indexing threads don't help flushing these DWPTs, they keep accumulating in DocumentsWriterFlushControl#blockedFlushes.
We hit the limit on the number of open files before DocumentsWriterFlushControl#finishFullFlush had a chance to prune all the blocked flushes.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@s1monw I'd be interested in your take on this one.

There's a thread on dev@ regarding whether this issue should block 8.10. This looks to me like a real problem, but I don't think it is a new one and it looks like it is unlikely to hit users in practice since it requires a low maxDocsPerChunks combined with a high indexing rate so I don't think we should hold 8.10 for this issue?

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yeah now that we understand the (exotic) root cause, I don't think this should block 8.10. I'll reply on the dev thread.

Though, there was a similar failure in 8.x build, in a different test. Let's try to understand that cause too.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK the 8.x failure repros for me:

ant test  -Dtestcase=TestIndexWriterThreadsToSegments -Dtests.seed=FAA2519FF20B50F7 -Dtests.multiplier=2 -Dtests.nightly=true -Dtests.slow=true -Dtests.linedocsfile=/home/jenkins/jenkins-slave/workspace/Lucene/Lucene-Solr-NightlyTests-8.x/test-data/enwiki.random.lines.txt -Dtests.locale=et-EE -Dtests.timezone=Australia/Victoria -Dtests.asserts=true -Dtests.file.encoding=UTF-8

I'm testing with adding -Dtests.verbose=true.

asfimport commented 3 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Hey @jpountz I think your analysis is right and makes sense to me. Unfortunately this is somewhat expected and an existing limitation of the throttle mechanics in IW. I also agree it's an edge case and very unlikely to happen anywhere outside of a randomized test. Due to the fact that we are doing a full flush we can't finish flushing the buffered segments that pile up caused by randomly set buffered docs. I think the TooManyOpenFilesExc is only one possible issue we could hit here, OOM would be another. I think we can improve the protection IW offers here and either add a non-aborting exception that rejects the document before it gets our of hand with too many buffered segments which would be a awkward to the user or we improve our stalling option which is currently solely based on how big our flush buffer is. I think we can certainly find a good default that would prevent 600+ blocked segments without it having any impact in the real world. If I am not mistaken I think we can simply change this one line here to add some upper-bound to make it work.

asfimport commented 3 years ago

Nhat Nguyen (@dnhatn) (migrated from JIRA)

I also looked into the test failure and had a similar theory. I ran the test with Simon's suggestion (i.e., stall when the number of blocked flushes is over 256), and it passed. I am +1 to introduce this protection. Although, this stall should be rarely triggered in production code as users should not set maxBufferedDocs too low.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to add that one-liner corner-case protection. Thanks @s1monw and @dnhatn!

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to this one-line fix to avoid worst-case scenarios

OOM would be another

+1 More generally I dislike that IndexWriter may use much more memory than configured on the IndexWriterConfig. Is there a way that blocked flushes could still be accounted against maxBufferedDocs/ramBufferSizeMB, and that IndexWriter would stall indexing when blocked flushes consume the entire maxBufferedDocs/ramBufferSizeMB budget?

asfimport commented 3 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Oh don’t get me wrong Adrien, we do that. We have that protection in place. If we use 2x the configured ram buffer for flushing we stall. The ram buffer is not an upperbound the iw will use but rather a marker that triggers a flush. Yet, we still use a function of it to make sure we don’t go OOM when flushes are slow.

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I bumped the limit on just this test class to be twice the default. I know it masks the underlying issue but we know what it is, can verify any potential fix by removing the annotation, and the repeated failures in jenkins don't bring in anything new.

This test passes for me with an increased handle count, although it takes a loooong time for this seed:

-ea -Dtests.seed=21D53F262220F3E9 -Dtests.multiplier=2 -Dtests.nightly=true

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit adec73dd284ebf565d7d205cccc519dac1ffe61c in lucene's branch refs/heads/main from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene.git;h=adec73d

LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit for nightlies in TestIndexWriterMergePolicy. (#424)

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Sorry, closed this by accident - it's definitely not fixed.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 49c8cee06f01e79a1b890d9b542ada79bacf9dd4 in lucene's branch refs/heads/branch_9x from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene.git;h=49c8cee06f0

LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec on TestIndexWriterMergePolicy. (#851)

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit cb206196e4c929b6e639c58891950e5d6164f63d in lucene's branch refs/heads/branch_9_1 from Dawid Weiss https://gitbox.apache.org/repos/asf?p=lucene.git;h=cb206196e4c

LUCENE-10088: allow per-class override in HandleLimitFS. Bump the limit a bit for nightlies in TestIndexWriterMergePolicy. (#424). Suppress SimpleTextCodec on TestIndexWriterMergePolicy. (#851)

apache / lucene

Too many open files in TestIndexWriterMergePolicy.testStressUpdateSameDocumentWithMergeOnGetReader [LUCENE-10088] #11126