cloudant-labs / clouseau

Expose Lucene features as an erlang-like node
Apache License 2.0
58 stars 32 forks source link

fix(`IndexWriter`): work around resource leak on failure of `rollback()` #73

Closed pgj closed 1 year ago

pgj commented 1 year ago

The rollback() method of Lucene's IndexWriter is prone to leak resources on being hit by an exception (see LUCENE-5544). This causes locks to left behind and makes the subsequent queries fail for the impacted index until restart.

That bug has been hiding here for a while now, and it was fixed in 2014, in Lucene 4.7.1. Recently it cropped up actively when the upgrade to OTP 25 was tested on Apple Silicon (ARM64) and there is the worry that it may show up in production later.

For the record, the bug could be triggered by running CouchDB's Mango integration test suite in the following way:

make mango-test \
  MANGO_TEST_OPTS="10-disable-array-length-field-test.DisableIndexArrayLengthsTest.test_enable_index_array_length"

Per @rnewson's suggestion, replace rollback() for close() to work this problem around. The rollback() method is faster/cheaper but using close() does not seem to have problems with releasing the locks.

jaydoane commented 1 year ago

Running the mango test suite against a 3 node cluster and saw this error:

[actor:15] WARN clouseau - shards/00000000-ffffffff/mango_test_86364a90e970471c9e330828b69e0d25.1684347442/86386af6145397cbd3ad585a23380795 Error while closing writer
java.io.FileNotFoundException: _0.cfe
    at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:267)
    at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:141)
    at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:513)
    at org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:285)
    at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:1965)
    at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1935)
    at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3059)
    at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:987)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:931)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:893)
    at com.cloudant.clouseau.IndexService.exit(IndexService.scala:222)
    at scalang.Process.trapExit(Process.scala:58)
    at scalang.Process.handleExit(Process.scala:46)
    at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
    at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
    at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
    at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
    at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
    at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
    at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
    at java.lang.Thread.run(Thread.java:748)
    at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)
[actor:15] ERROR scalang.node.ServiceLauncher - An error occurred during handleExit in actor scalang.node.ProcessHolder$$anon$2@3abbbb74
org.apache.lucene.store.LockReleaseFailedException: Cannot forcefully unlock a NativeFSLock which is held by another indexer component: /Users/jay/repos/clouseau/target/clouseau3/shards/00000000-ffffffff/mango_test_86364a90e970471c9e330828b69e0d25.1684347442/86386af6145397cbd3ad585a23380795/write.lock
    at org.apache.lucene.store.NativeFSLock.release(NativeFSLockFactory.java:295)
    at org.apache.lucene.index.IndexWriter.unlock(IndexWriter.java:4472)
    at com.cloudant.clouseau.IndexService.exit(IndexService.scala:229)
    at scalang.Process.trapExit(Process.scala:58)
    at scalang.Process.handleExit(Process.scala:46)
    at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
    at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
    at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
    at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
    at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
    at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
    at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
    at java.lang.Thread.run(Thread.java:748)
    at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)
[actor:1] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_86364a90e970471c9e330828b69e0d25.1684347442
[actor:4] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_86364a90e970471c9e330828b69e0d25.1684347442
[actor:17] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_4ddf5ce40c2542c0a61961c4f97ca77d.1684347442
[actor:19] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_4ddf5ce40c2542c0a61961c4f97ca77d.1684347442
[actor:3] WARN clouseau - shards/00000000-ffffffff/mango_test_4ddf5ce40c2542c0a61961c4f97ca77d.1684347442/86386af6145397cbd3ad585a23380795 Error while closing writer
java.io.FileNotFoundException: _0.cfs
    at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:267)
    at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:141)
    at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:513)
    at org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:285)
    at org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:1965)
    at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1935)
    at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3059)
    at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:987)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:931)
    at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:893)
    at com.cloudant.clouseau.IndexService.exit(IndexService.scala:222)
    at scalang.Process.trapExit(Process.scala:58)
    at scalang.Process.handleExit(Process.scala:46)
    at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
    at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
    at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
    at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
    at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
    at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
    at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
    at java.lang.Thread.run(Thread.java:748)
    at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)

I wanted to make sure I had correctly upgraded my local clouseau, and it does seem that IndexService.scala:222 did actually call close() instead of rollback().

So far, seems not difficult to repro.

pgj commented 1 year ago

Looks like close() is not an option and only rollback() can work?

pgj commented 1 year ago

With a 3-node cluster I can also see 500s but nothing shows up in the Clouseau logs when the crash happens.

rnewson commented 1 year ago

Jay's new error I think is from trying to commit an empty index (as close() implicitly commits). That throws an exception and then we try to forcibly unlock it while holding the native lock (because close() didn't complete).

What a mess!

I think the only sensible path forward is to upgrade Lucene. I would go straight to 4.10.4, but not further.

pgj commented 1 year ago

I opened #74 for the Lucene 4.10.4 update.

pgj commented 1 year ago

I am closing this pull request in favor of #74.