Closed benwtrent closed 5 months ago
Are we sure this is all the same bug in JDK 22?
The above bugs are about FillerArray and the other one about strange segment states. How does this fit together?
All we have right now is a correlation: we've seen quite a few nodes fall over with that JVM bug, and some subset of them have come back up and discovered a missing .si
(and .cfe
) file. Ben and I have spent some time going through the IndexWriter#commit
process without finding anything obvious, but we wonder if the JVM bug is throwing an exception in a place where we're assuming no exception could possibly be thrown.
In particular I'm a little suspicious about the way we call org.apache.lucene.index.IndexFileDeleter#incRef(org.apache.lucene.index.SegmentInfos, boolean)
after doing the rename and setting commitCompleted = true
, all within a try (Closeable finalizer = () -> deleter.decRef(commitFiles)) {...}
block. Are we sure it's ok if an exception is thrown just before that incRef()
call? Is there something else preventing those files from being decRef
'd into the void? Sorry if this is a red herring, I don't spend much time in this code so I don't have all the relevant invariants in my head.
Now it makes sense: The broken segments look like a side-effect of the JVM bug because it brings IndexWriter into some broken state so it deletes the files without the commit fully applied.
I was able to replicate the weird behavior by randomly failing an incRef
for a given file. The JDK bug could occur during a simple iteration, meaning, we could increment some but not all files. This is obviously a pretty abusive and manual replication. I am not 100% sure this is exactly what is happening.
I have added random exceptions in each step of the index pre-commit and commit process and haven't been able to replicate it outside of forcibly failing a file incRef
.
if the bug is in G1, just use a different collector to workaround it?
@rmuir for sure, that is an option.
I opened this issue as it seems like Lucene isn't rolling back correctly and it struck me as problematic. Ending up in an un-readable state due to a random exception is regrettable, even if the JDK is doing something bonkers. I realize this could be 'best effort' when the JDK becomes nondeterministic, so I wanted to see if Lucene's effort here could be better.
I have been testing this by randomly throwing at IndexWriter#testPoint
. I added some more test points to simulate failures throughout the process. During this testing I found other issues (double file decrements for example), but haven't been able to fully replicate the weird missing file failure besides abusing the FileDeleter#incRef
method.
Only thing lucene could do is to not do this reference counting with java code in indexwriter. Can't make java code more complicated to deal with issues like this, if the jvm is broke then it's broke.
So only other option is to not count in java, That way jvm corruption can't hurt it. Means using dup2() or similar and letting the kernel take care.
You could get an exception in these places even if there were no JVM bug, for instance an OOME can be thrown basically anywhere. The JVM still carries on running finally
blocks and other stuff after hitting an OOME (unless you tell it not to, but most folks apparently don't). Not that I think this is in any way a sensible default behaviour or anything, but still it's not something I think we can change.
@rmuir for sure, I am not suggesting one solution or the other. Just that the only way I could reproduce was abusing the code in horrific ways. Which is a testament to the general resiliency in Lucene.
I am still trying to replicate it in some way that isn't a "break the incref" scenario.
Basically we can't do anything if due to broken JVM code, the finally block is not called. This can happen on JVM bugs only and then we're lost.
This is why indexwriter closes itself on oom.
and really, that safety mechanism is probably the easiest way to defend against this one. Problem is that only VirtualMachineError
is considered a tragedy, and IncompatibleClassChangeError
isn't a subclass of that.
If you extend tragic handling to VirtualMachineError | LinkageError
, then indexwriter won't delete files after this hits.
Please, see PR: #13277
I think we should be more defensive here.
Description
There is a nasty bug in JDK 22:
Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/106987 JDK issue: https://bugs.openjdk.org/browse/JDK-8329528
This JDK bug is not present in JDK 21.0.x: https://github.com/elastic/elasticsearch/issues/106987#issuecomment-2039591571 ( it appears to be a regression starting with JDK 22 )
We noticed in places where this bug reared its ugly head, segment state ended up in a strange place.
The resulting segment state did create the
segments_N
file, and had a CFS file. But, was missing.si
and.cfe
.This smells like we have an over-eager cleanup code somewhere. As the rename from
pending_*
seems to have occurred, and the CFS file created, but we deleted the.si
and.cfe
files.Version and environment details
JDK 22
Lucene 9.10