apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Strange Segment State after encoutering JDK22 bug #13275

Closed benwtrent closed 5 months ago

benwtrent commented 5 months ago

Description

There is a nasty bug in JDK 22:

Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/106987 JDK issue: https://bugs.openjdk.org/browse/JDK-8329528

This JDK bug is not present in JDK 21.0.x: https://github.com/elastic/elasticsearch/issues/106987#issuecomment-2039591571 ( it appears to be a regression starting with JDK 22 )

We noticed in places where this bug reared its ugly head, segment state ended up in a strange place.

Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="<DATA_PATH>/segments_qt")))
Caused by: java.nio.file.NoSuchFileException: <DATAPATH>/_9xo.si

The resulting segment state did create the segments_N file, and had a CFS file. But, was missing .si and .cfe.

This smells like we have an over-eager cleanup code somewhere. As the rename from pending_* seems to have occurred, and the CFS file created, but we deleted the .si and .cfe files.

Version and environment details

JDK 22

Lucene 9.10

uschindler commented 5 months ago

Are we sure this is all the same bug in JDK 22?

The above bugs are about FillerArray and the other one about strange segment states. How does this fit together?

DaveCTurner commented 5 months ago

All we have right now is a correlation: we've seen quite a few nodes fall over with that JVM bug, and some subset of them have come back up and discovered a missing .si (and .cfe) file. Ben and I have spent some time going through the IndexWriter#commit process without finding anything obvious, but we wonder if the JVM bug is throwing an exception in a place where we're assuming no exception could possibly be thrown.

DaveCTurner commented 5 months ago

In particular I'm a little suspicious about the way we call org.apache.lucene.index.IndexFileDeleter#incRef(org.apache.lucene.index.SegmentInfos, boolean) after doing the rename and setting commitCompleted = true, all within a try (Closeable finalizer = () -> deleter.decRef(commitFiles)) {...} block. Are we sure it's ok if an exception is thrown just before that incRef() call? Is there something else preventing those files from being decRef'd into the void? Sorry if this is a red herring, I don't spend much time in this code so I don't have all the relevant invariants in my head.

uschindler commented 5 months ago

Now it makes sense: The broken segments look like a side-effect of the JVM bug because it brings IndexWriter into some broken state so it deletes the files without the commit fully applied.

benwtrent commented 5 months ago

I was able to replicate the weird behavior by randomly failing an incRef for a given file. The JDK bug could occur during a simple iteration, meaning, we could increment some but not all files. This is obviously a pretty abusive and manual replication. I am not 100% sure this is exactly what is happening.

I have added random exceptions in each step of the index pre-commit and commit process and haven't been able to replicate it outside of forcibly failing a file incRef.

rmuir commented 5 months ago

if the bug is in G1, just use a different collector to workaround it?

benwtrent commented 5 months ago

@rmuir for sure, that is an option.

I opened this issue as it seems like Lucene isn't rolling back correctly and it struck me as problematic. Ending up in an un-readable state due to a random exception is regrettable, even if the JDK is doing something bonkers. I realize this could be 'best effort' when the JDK becomes nondeterministic, so I wanted to see if Lucene's effort here could be better.

I have been testing this by randomly throwing at IndexWriter#testPoint. I added some more test points to simulate failures throughout the process. During this testing I found other issues (double file decrements for example), but haven't been able to fully replicate the weird missing file failure besides abusing the FileDeleter#incRef method.

rmuir commented 5 months ago

Only thing lucene could do is to not do this reference counting with java code in indexwriter. Can't make java code more complicated to deal with issues like this, if the jvm is broke then it's broke.

So only other option is to not count in java, That way jvm corruption can't hurt it. Means using dup2() or similar and letting the kernel take care.

DaveCTurner commented 5 months ago

You could get an exception in these places even if there were no JVM bug, for instance an OOME can be thrown basically anywhere. The JVM still carries on running finally blocks and other stuff after hitting an OOME (unless you tell it not to, but most folks apparently don't). Not that I think this is in any way a sensible default behaviour or anything, but still it's not something I think we can change.

benwtrent commented 5 months ago

@rmuir for sure, I am not suggesting one solution or the other. Just that the only way I could reproduce was abusing the code in horrific ways. Which is a testament to the general resiliency in Lucene.

I am still trying to replicate it in some way that isn't a "break the incref" scenario.

uschindler commented 5 months ago

Basically we can't do anything if due to broken JVM code, the finally block is not called. This can happen on JVM bugs only and then we're lost.

rmuir commented 5 months ago

This is why indexwriter closes itself on oom.

rmuir commented 5 months ago

and really, that safety mechanism is probably the easiest way to defend against this one. Problem is that only VirtualMachineError is considered a tragedy, and IncompatibleClassChangeError isn't a subclass of that.

If you extend tragic handling to VirtualMachineError | LinkageError, then indexwriter won't delete files after this hits.

rmuir commented 5 months ago

Please, see PR: #13277

I think we should be more defensive here.