apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.03k forks source link

If IndexWriter is interrupted on close and is using a channel (mmap/nio), it can throw a ClosedByInterruptException and prevent you from opening a new IndexWriter in the same proceses if you are using Native locks. [LUCENE-4638] #5703

Open asfimport opened 11 years ago

asfimport commented 11 years ago

The ClosedByInterruptException will prevent the index from being unlocked in close. If you try and close again, the call will hang. If you are using native locks and try to open a new IndexWriter, it will fail to get the lock. If you try IW#forceUnlock, it wont work because the not fully closed IW will still have the lock.

ideas:


Migrated from LUCENE-4638 by Mark Miller (@markrmiller), 2 votes, updated May 09 2016 Attachments: LUCENE-4638.patch Linked issues:

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm why does the 2nd call to close hang? Do you have the original exc?

IW.rollback() should do a "better job" closing and releasing the lock, and in general on getting an exception from IW.close I think it's the only real recourse you have (ie, it's hard to know what docs you lost due to that exception).

Also, I think #5316 (IW.close should "just close", not wait for merges, commit, etc.) would improve this situation because then close would reliably release the lock.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I should be able to find what it was hanging on, but a lot of logs to look back through. I can probably reproduce more easily instead when I get home tonight. If I remember right, it was trying to open an mmap input or something and if I remember right, it was just blocking. I'll reproduce and report the exact details.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I have not had a chance to duplicate the hang yet - using fullmetaljenkins to work on some other bugs. I really could use a resolution to this though.

Currently, the advice for cleaning up after an IndexWriter in the javadoc is broken with native locks. You can't necessarily call close twice and you can't unlock using the static unlock method.

Here is a patch that provides a way for users to use the unlock in a finally pattern that is safe for native locks.

It adds a forceUnlock method to IndexWriter that is not static.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Probably want to put a != null check around the writeLock.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't think we should rush a fix here.

Let's see if rollback would have fixed it (and really javadocs should state that as the "recovery" if you hit exc during close), and let's understand what was hanging in the 2nd call to IW.close.

I think a new forceUnluck method in IndexWriter is too dangerous because the IndexWriter technically is still open so the app can continue to do ops after releasing the lock.

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think a new forceUnluck method in IndexWriter is too dangerous

It's the same as the current static unlock method and javadocd the same.

I'm okay with it not being in Lucene though - I figure users would like to avoid this bug as well, but simply making the lock factory protected exposes it in an advanced enough way that it couldnt be considered dangerous. That would let me gid rid of this bug as well.

Let's see if rollback would have fixed it

I'll try that.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

simply making the lock factory protected exposes it in an advanced enough way that it couldnt be considered dangerous. That would let me gid rid of this bug as well.

I think that's a good solution for this issue?

It would still be nice to know if rollback resolves it (it's supposed to!), and why the 2nd IW.close() hangs (which is weird).

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

if thats what gets committed, please keep the issue open in that case.

This kind of behavior in close is outright buggy. its because its doing too much.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also i would view it as a temporary solution, like until we have time to unfuck close() to not do so much.

I dont care that the issue is controversial. Its time to bring this to a head. I'm good at that.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

just a start so there is no wimpy solution committed permanently because close() does too much. I dont want 4.1 released with that solution.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425561

LUCENE-4638, SOLR-3180: try using the IW's writeLock to unlock

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425563

LUCENE-4638, SOLR-3180: try using the IW's writeLock to unlock

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425574

LUCENE-4638, SOLR-3180: revert for now (try using the IW's writeLock to unlock)

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1425576

LUCENE-4638, SOLR-3180: revert for now (try using the IW's writeLock to unlock)

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 9 years ago

Scott Blum (@dragonsinth) (migrated from JIRA)

Any update on this? In our mostly-stock 5.2.1 Solr deployment, we are hitting a point where we get cores into a permanently wedged state all the time, and there seems to be no fix except to restart the entire node (JVM). The IndexWriter gets into a broken state with ClosedByInterrupt, and it never gets out of it, and no new IndexWriter (maybe also no new searchers) can be created. This is one of our biggest operational issues right now.