Merges should periodically check for abort [LUCENE-10658]

asfimport commented 2 years ago

Rolling back an IndexWriter without committing shouldn't take long (i.e., less than several seconds), and Elasticsearch cluster coordination relies on this assumption. If some merges are taking place, the rollback can take several minutes as merges only check for abort when writing to files via MergeRateLimiter. Merging a completion field, for example, can take a long time without touching output files. Another reason merges should periodically check for abort is its outputs will be discarded.

Migrated from LUCENE-10658 by Nhat Nguyen (@dnhatn), updated Jul 28 2022

asfimport commented 2 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1, merges should abort promptly. But it is indeed only a "best effort" mechanism.

I guess Lucene's completion field is building FSTs during merging and not writing bytes to disk as it builds the large FST, until the end?

Maybe there are other parts of Lucene merging that also fail to check promptly enough, e.g. maybe when dimensional points are doing a (large) offline sort before writing anything to the output files?

Maybe we could instrument MergeRateLimiter to write a WARNING into infoStream whenever too much time has elapsed between visits to its maybePause API? We could use that to tease out other places that are failing to write bytes frequently enough for abort checking.

Lucene used to check for merge abort deep inside IndexWriter and merging code (e.g. merging postings would check periodically, same for doc values, etc.), but I think we refactored that down to the rate limiter only in #8751 which was a nice cleanup / step forward.

asfimport commented 2 years ago

Vigya Sharma (@vigyasharma) (migrated from JIRA)

How about interrupting the merge threads when all merges are to be aborted, like in a rollback() request?

This would at least work for the concurrent merge scheduler, where we track all the active mergeThreads in a list, and can interrupt them, instead of waiting for them to hit a checkAborted() call. We would still need aborted checks though, to work with the sequential merge scheduler, but this could address a significant number of use cases.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thread#interrupt has the side effect on invalidating every NIOFSIndexInput that is blocked on I/O, the javadocs on NIOFSDirectory give some more details, so we generally discourage from using Thread#interrupt. I guess it would not be ok to interrupt because there might be open IndexReaders that reference some of the segments that are being merged and that users plan on keeping using after the call to rollback?

asfimport commented 2 years ago

Vigya Sharma (@vigyasharma) (migrated from JIRA)

Good point.. We use pooled readers and interrupting those threads could make the readers invalid. Thanks for sharing this Adrien!

Adding here for reference, per java docs:

Accessing this class either directly or indirectly from a thread while it's interrupted can close the underlying file descriptor immediately if at the same time the thread is blocked on IO. The file descriptor will remain closed and subsequent access to NIOFSDirectory will throw a ClosedChannelException.

apache / lucene

Merges should periodically check for abort [LUCENE-10658] #11694