Open benwtrent opened 2 months ago
OK, the NPE in sort, I did some manual debugging via good ole System.out.println
. This only happens in the assertion if the cache check is greater than 1, which does seem to happen with intra-merge concurrency.
However, if I undo all the concurrency and printout the assertion check lines (that pass in without concurrency), I still see that sort: null
. This tells me that this assertion has always had an issue with sort == null
and not double checking it.
I think I know the issue with the parallel merging. This only happens when we use a SortingCodecReader.
The key issue is here: https://github.com/apache/lucene/commit/17c285d61743da0c06735e06235b20bd5aac4e14
This adjusted the caching as to where:
private <T> T getOrCreateNorms(String field, IOSupplier<T> supplier) throws IOException {
return getOrCreate(field, true, supplier);
}
private synchronized <T> T getOrCreate(String field, boolean norms, IOSupplier<T> supplier)
throws IOException {
if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
assert assertCreatedOnlyOnce(field, norms);
cachedObject = supplier.get();
cachedField = field;
cacheIsNorms = norms;
}
assert cachedObject != null;
return (T) cachedObject;
}
private <T> T getOrCreateDV(String field, IOSupplier<T> supplier) throws IOException {
return getOrCreate(field, false, supplier);
}
This will cause a weird race condition as when merging norms will call getOrCreateNorms
and in parallel, we could be calling getOrCreateDV
, either of which will overwrite the other, and then potentially double cache.
Parallel merging breaks these assumptions and could cause issues.
@iverase @jpountz I propose we remove intra-merging parallelism from norms, terms, and doc values and do a 9.11.1 release.
We can incrementally add those back in the future.
Parallel merging breaks these assumptions and could cause issues.
Well, the assumptions are that its only accessed once. But now in parallel merging, it could be re-cached any number of times as the norms fields vs. the dv fields keep kicking eachother out of cache.
Seems like a bad idea regardless and I would like us to spend time making sure this is OK.
The proposal for a 9.11.1 is out of caution here.
Disabling concurrent merging for terms, norms and doc values until we figure out how to make it compatible with SortingCodecReader sounds good to me.
Description
It was noticed that the CMS intra-merge behavior was not fully tested. In an effort to do this, a change to override when the intra-merge scheduler is used has been drafted. https://github.com/apache/lucene/pull/13475
This PR has exposed many existing assertions that may all be weird testing failures. But a couple are more worrisome.
In particular:
Will periodically fail with an NPE on the
sort
:Version and environment details
No response