apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Strange ConcurrentMergeScheduler behavior with intra-merge threads #13478

Open benwtrent opened 2 months ago

benwtrent commented 2 months ago

Description

It was noticed that the CMS intra-merge behavior was not fully tested. In an effort to do this, a change to override when the intra-merge scheduler is used has been drafted. https://github.com/apache/lucene/pull/13475

This PR has exposed many existing assertions that may all be weird testing failures. But a couple are more worrisome.

In particular:

./gradlew test --tests TestCodecHoldsOpenFiles.test -Dtests.seed=AA3C2FCDA0773874 -Dtests.locale=ar-PS -Dtests.timezone=Asia/Magadan -Dtests.asserts=true -Dtests.file.encoding=UTF-8

Will periodically fail with an NPE on the sort:

        org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.search.Sort.getSort()" because the return value of "org.apache.lucene.index.LeafMetaData.getSort()" is null
            at __randomizedtesting.SeedInfo.seed([AA3C2FCDA0773874]:0)
            at app//org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:764)
            at app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:756)

            Caused by:
            java.lang.NullPointerException: Cannot invoke "org.apache.lucene.search.Sort.getSort()" because the return value of "org.apache.lucene.index.LeafMetaData.getSort()" is null
                at org.apache.lucene.index.SortingCodecReader.assertCreatedOnlyOnce(SortingCodecReader.java:739)
                at org.apache.lucene.index.SortingCodecReader.getOrCreate(SortingCodecReader.java:721)
                at org.apache.lucene.index.SortingCodecReader.getOrCreateDV(SortingCodecReader.java:759)
                at org.apache.lucene.index.SortingCodecReader$7.getNumeric(SortingCodecReader.java:571)
                at org.apache.lucene.codecs.DocValuesConsumer$1.getNumeric(DocValuesConsumer.java:201)
                at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer$1.getSortedNumeric(Lucene90DocValuesConsumer.java:138)
                at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.writeValues(Lucene90DocValuesConsumer.java:188)
                at org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.addNumericField(Lucene90DocValuesConsumer.java:133)
                at org.apache.lucene.tests.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addNumericField(AssertingDocValuesFormat.java:87)
                at org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:183)
                at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:142)
                at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151)
                at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:205)
                at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:325)
                at org.apache.lucene.index.SegmentMerger.lambda$merge$1(SegmentMerger.java:155)
                at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:117)
                at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
                at org.apache.lucene.index.ConcurrentMergeScheduler$CachedExecutor.lambda$execute$0(ConcurrentMergeScheduler.java:982)
                at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
                at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
                at java.base/java.lang.Thread.run(Thread.java:1583)

Version and environment details

No response

benwtrent commented 2 months ago

OK, the NPE in sort, I did some manual debugging via good ole System.out.println. This only happens in the assertion if the cache check is greater than 1, which does seem to happen with intra-merge concurrency.

However, if I undo all the concurrency and printout the assertion check lines (that pass in without concurrency), I still see that sort: null. This tells me that this assertion has always had an issue with sort == null and not double checking it.

benwtrent commented 2 months ago

I think I know the issue with the parallel merging. This only happens when we use a SortingCodecReader.

The key issue is here: https://github.com/apache/lucene/commit/17c285d61743da0c06735e06235b20bd5aac4e14

This adjusted the caching as to where:

private <T> T getOrCreateNorms(String field, IOSupplier<T> supplier) throws IOException {
    return getOrCreate(field, true, supplier);
  }

private synchronized <T> T getOrCreate(String field, boolean norms, IOSupplier<T> supplier)
      throws IOException {
    if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
      assert assertCreatedOnlyOnce(field, norms);
      cachedObject = supplier.get();
      cachedField = field;
      cacheIsNorms = norms;
    }
    assert cachedObject != null;
    return (T) cachedObject;
  }

  private <T> T getOrCreateDV(String field, IOSupplier<T> supplier) throws IOException {
    return getOrCreate(field, false, supplier);
  }

This will cause a weird race condition as when merging norms will call getOrCreateNorms and in parallel, we could be calling getOrCreateDV, either of which will overwrite the other, and then potentially double cache.

Parallel merging breaks these assumptions and could cause issues.

@iverase @jpountz I propose we remove intra-merging parallelism from norms, terms, and doc values and do a 9.11.1 release.

We can incrementally add those back in the future.

benwtrent commented 2 months ago

Parallel merging breaks these assumptions and could cause issues.

Well, the assumptions are that its only accessed once. But now in parallel merging, it could be re-cached any number of times as the norms fields vs. the dv fields keep kicking eachother out of cache.

Seems like a bad idea regardless and I would like us to spend time making sure this is OK.

The proposal for a 9.11.1 is out of caution here.

jpountz commented 2 months ago

Disabling concurrent merging for terms, norms and doc values until we figure out how to make it compatible with SortingCodecReader sounds good to me.