cloudant-labs / clouseau

Expose Lucene features as an erlang-like node
Apache License 2.0
58 stars 32 forks source link

fix: race condition between index close and clean-up #77

Closed pgj closed 1 year ago

pgj commented 1 year ago

When a database is deleted, the corresponding directory structure is removed entirely, even if the index writer has not yet been closed. Due to the missing files, both attemps of closing the index and then rolling it back fail, which eventually leads to leak of resources, more specificially locks, due to a known bug in Lucene versions prior to 4.7.1. In result, the subsequent search requests to the same index will fail when recreated.

The reason for this is that the clean-up service might be triggered earlier than the index service is killed off and the order of removal (first the lock, then the rest of the files) may become reversed. Adjust the services to handle the situation gracefully.

rnewson commented 1 year ago

copying from channel

1) handleInfo also handled sequentially off the message queue so it blocks just as much as before 2) casting to FSDirectory is inappropriate 3) Directory has a method to list files and a method to delete files, so we can delete through the abstraction 4) calling getDirectory.delete will fail if the directory is not empty, which will always be the case here? 5) if delete() fails we send the message again, it only terminates if delete returns true (which won't happen if the directory isn't empty)

rnewson commented 1 year ago

I think the _search_cleanup logic works (though we sadly appear not to call it from smoosh), so ... maybe this?

diff --git a/src/main/scala/com/cloudant/clouseau/IndexCleanupService.scala b/src/main/scala/com/cloudant/clouseau/IndexCleanupService.scala
index d3368c0..a32ca24 100644
--- a/src/main/scala/com/cloudant/clouseau/IndexCleanupService.scala
+++ b/src/main/scala/com/cloudant/clouseau/IndexCleanupService.scala
@@ -31,7 +31,8 @@ class IndexCleanupService(ctx: ServiceContext[ConfigurationArgs]) extends Servic
     case CleanupPathMsg(path: String) =>
       val dir = new File(rootDir, path)
       logger.info("Removing %s".format(path))
-      recursivelyDelete(dir, true)
+      val pattern = Pattern.compile(path + "/([0-9a-f]+)$")
+      cleanup(dir, pattern, Nil)
     case RenamePathMsg(dbName: String) =>
       val srcDir = new File(rootDir, dbName)
       val sdf = new SimpleDateFormat("yyyyMMdd'.'HHmmss")
pgj commented 1 year ago

Simply replacing recursivelyDelete() for cleanup does not indeed trigger the bug but directories are not cleaned up. This would only happen if closing the indexes is followed by the removal of the containing directory.

Note that there is no problem with deleting the files from the directory, Directory does a great job at that. But it does not cover the removal of the directory itself. Calling close() does not take care of that. I do not see why the cast is inappropriate -- it is not always done, only when it is possible.

I believe that the (original) massive recursivelyDelete() call on cleaning up the path is a different way of dealing with the lack of directory removal from the side of IndexServices. In this approach, the task is delegated to them, and the path clean-up should only do the final touch by removing the top-level directory (with the database name). This top-level directory becomes (should become) empty eventually.

What I am not sure about is that every IndexService under the clean-up path is scheduled to be removed by the time when CleanupPathMsg is processed. Therefore I agree that cleanup() is a safer way to fix the problem but it has to be complemented with the directory deletions at the IndexService level.

rnewson commented 1 year ago

java.io.File#delete() is not recursive, it would only delete a single file or a single directory (if the directory was empty).

Anyway, agree we should try to delete any newly empty directories that deleting indexes will create, all the way up to clouseau.dir (but certainly not further).

My main worry there is if a concurrent request comes in that creates a new index. It would create the directory again then go on to make files in that directory (the write.lock for starters), but there's a gap there. the index creation would fail if cleanup had deleted the directory in the middle.

A significant mitigation here though is that even if a user deleted and recreated a database with search indexes as quickly as possible, the shard names, and thus the file paths and directory names, would be different, because of the shard suffix, which changes every millisecond.

So... if we're doing cleanup because the database was deleted (as opposed to being here because of _search_cleanup where we intend to remove any indexes not in 'activeSigs') I think it will be safe to attempt to delete the empty dirs from the path directory up to clouseau.dir directory.

rnewson commented 1 year ago

more succinctly, we can add a deleteEmptyDirs call after cleanup(dir, pattern, Nil) when handling the CleanupPathMsg message.

pgj commented 1 year ago

java.io.File#delete() is not recursive

Yes, I am aware that java.io.File#delete() is not recursive. That is why I introduced a wait until the top-level directory becomes empty so that even java.io.File#delete() can delete it.

My main worry [..] there's a gap there. the index creation would fail if cleanup had deleted the directory in the middle.

But I understand the concern that it cannot be guaranteed that an empty top-level directory clearly means that there will not be further indexes created (i.e. sub-directories, possibly populated with lock files).

A significant mitigation here though is that even if a user deleted and recreated a database with search indexes as quickly as possible, the shard names, and thus the file paths and directory names, would be different, because of the shard suffix, which changes every millisecond.

Maybe the millisecond resolution is not enough any more? I frequently see multiple clean-up attempts with exactly the same path (same shard, same database, same index, same timestamp, though different actors), e.g.:

[actor:1] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_dbe8a260d91e4d5cb502105a51cca6d8.1686302963
[actor:4] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_dbe8a260d91e4d5cb502105a51cca6d8.1686302963
[actor:8] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_dbe8a260d91e4d5cb502105a51cca6d8.1686302963
[actor:0] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_dbe8a260d91e4d5cb502105a51cca6d8.1686302963
[actor:9] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_dbe8a260d91e4d5cb502105a51cca6d8.1686302963

more succinctly, we can add a deleteEmptyDirs call after cleanup(dir, pattern, Nil) when handling the CleanupPathMsg message.

Is this deleteEmptyDirs a function to be written that implements the named functionality or is this available somewhere in the code or in the libraries?

rnewson commented 1 year ago

millisecond precision isn't enough for a test suite, no, but in production it is unlikely to occur.

I meant to say that deleteEmptyDirs is a function we would need to write.

pgj commented 1 year ago

I have added the recursive deletion of empty directories after the cleanup call. But I still have the feeling that it does not solve the problem fully -- it is not guaranteed that the database-level directories (with the timestamp suffix) will be removed. The index services may not stop and clean up after themselves in time, the removal of directories should be re-tried later.

I changed the unit test to be more detailed about the removal of the directory structure in question. This is achieved by introducing a deeper structure, similar to the ones in the real world settings. (It is not published here yet.) The latest version I pushed (680236c) fails to this updated test as it may leave the database directory behind.

pgj commented 1 year ago

Pushed a (hopefully) enhanced version of the fix, along with the stricter unit test.

rnewson commented 1 year ago

I don't think this is an improvement, sorry. It appears to delete all the indexes whenever the indexservice processes exit, which they do regularly and is quite normal.

Can we focus tightly on just fixing the race condition so we can avoid the deadlock on rollback()?

pgj commented 1 year ago

The delete operation invoked on the directory is not recursive, hence it will only succeed if the directory does not contain any files any more. It can be viewed as a lazy cleanup once the index is not used any more. (To some degree, I have used this in my earlier sketches too.) In my understanding, when an index service is created it will recreate the necessary directories in order to populate them with files.

That is needed because the main idea behind this recent proposal is to partially delegate the removal of the index directories, along with their parents, to the index services. This way all the access to the same set of index files and their eventual removal would fall in the scope of a single process, which could eliminate the race. Then the task of the clean up service is to ask for the removal of the directories from the corresponding index service and delete all the files and directories otherwise. The cleanup service would only have to take care of the removal of the top-level (database-level) directory at the end, but that is only possible if there are no more pending deletions.

rnewson commented 1 year ago

ok, I see. I still think this is too much. You won't be able to guarantee the empty directories are deleted this way no matter what. If clouseau is not running when a database is deleted, nothing here can find and remove those indexes. That is a problem, but will need fixing elsewhere.

Given that, I don't think it's right to try to do too much here, I'd like to just see the smallest change that eliminates (or significantly reduces) the occurrences of hitting the rollback() deadlock bug in Lucene 4.6.1.

I especially don't want to see code conditional on dir being an FSDirectory. We should either change the variables to be that type (and not pretend that any Directory subclass is acceptable) or make it work for all of them.

I don't want to crush your enthusiasm, all of this work has been valuable to create (and review!). We must recognise that clouseau is quite crufty and cannot be "fixed" without significant effort, which I believe is happening elsewhere.

The true solution here is to be able to upgrade Lucene to a fixed version, but we just don't have the infrastructure for index migration (up and down) that makes that safe enough.

pgj commented 1 year ago

The explicit cast of FSDirectory is needed because IndexWriter#getDirectory() can only return Directory objects. However, if I understood the source code right, IndexWriters are instantiated with FSDirectory objects so it should always work, there are no other possibilities.

If we are looking for a way to reduce the chances of the hitting the bug by making the smallest possible change, I would recommend to increase the resolution of the timestamp that is used for the shard suffixes in CouchDB. Currently, it changes only every second. Making it to change every millisecond results in a different path on the recreation of the database. Because of that, it is quite likely that a directory other than the previous one is going to be locked.

diff --git a/src/mem3/src/mem3.erl b/src/mem3/src/mem3.erl
index a57cad2db..cd4456188 100644
--- a/src/mem3/src/mem3.erl
+++ b/src/mem3/src/mem3.erl
@@ -464,7 +464,7 @@ db_is_current(Name) when is_binary(Name) ->
     couch_server:exists(Name).

 generate_shard_suffix() ->
-    UnixSeconds = os:system_time(second),
+    UnixSeconds = os:system_time(millisecond),
     "." ++ integer_to_list(UnixSeconds).

 strip_shard_suffix(DbName) when is_binary(DbName) ->

But I guess this is also something that we cannot do. My other idea for patching the issue is presented in #78. Maybe that one is more acceptable under the current circumstances, but it is a hack :-)

rnewson commented 1 year ago

hm, I thought the race that affected production was between the cleanup code (that deletes all the files in the index) and the IndexService process exiting and attempting to release the write.lock.

It must be much less frequent for a database with search indexes to be deleted and recreated with search indexes in under a second.

The granularity of generate_shard_suffix isn't great, is it? I think everything that might need to know it will pull it from mem3 but I wouldn't be surprised if some code matches on it based on its current length, so I don't think we can change it without some checking, but then I don't think increasing granularity here will help with the production issue, unless I've misunderstood that.

pgj commented 1 year ago

In my current understanding, the clean-up of the path and the termination of the index service happen solely in response to events from Dreyfus (CouchDB). There is a specific cleanup message for triggering the clean-up service, but I am not yet sure how the termination is exactly induced. However, note that it came up in relation to the switch to OTP 25, which may result in faster execution of Erlang code.

The order of clean-up and termination might be the right one, but sometimes clean-up removes the directory where the index files are located. Then IndexWriter#rollback() explodes on listing the contents and it will not be able release the lock due to the bug.

Excerpts from the logs below show that this happens when clean-ups and terminations happen in a quick succession. Consider the regular case:

[actor:13] DEBUG clouseau - Creating shards/00000000-ffffffff/mango_test_6a3f5a2ac3b1490ebc1ed99671e443f2.1686766785/86386af6145397cbd3ad585a23380795
[actor:10] DEBUG clouseau - shards/00000000-ffffffff/mango_test_6a3f5a2ac3b1490ebc1ed99671e443f2.1686766785/86386af6145397cbd3ad585a23380795 Closed with reason: 'normal
[actor:12] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_6a3f5a2ac3b1490ebc1ed99671e443f2.1686766785

and the problematic case where an unexpected additional clean-up shows up:

[actor:4] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_2b4337aeffc048af94dce8e025e161aa.1686766812
[actor:3] DEBUG clouseau - Creating shards/00000000-ffffffff/mango_test_2b4337aeffc048af94dce8e025e161aa.1686766812/8ee79787c7c688ca999af5958b72bb73
[actor:15] INFO clouseau.cleanup - Removing shards/00000000-ffffffff/mango_test_2b4337aeffc048af94dce8e025e161aa.1686766812
[actor:14] DEBUG clouseau - shards/00000000-ffffffff/mango_test_2b4337aeffc048af94dce8e025e161aa.1686766812/8ee79787c7c688ca999af5958b72bb73 Closed with reason: 'normal
[actor:14] WARN clouseau - shards/00000000-ffffffff/mango_test_2b4337aeffc048af94dce8e025e161aa.1686766812/8ee79787c7c688ca999af5958b72bb73 Error while closing writer
org.apache.lucene.store.NoSuchDirectoryException: directory '/Users/gaborpali/projects/cloudant/clouseau/target/clouseau1/shards/00000000-ffffffff/mango_test_2b4337aeffc048af94dce8e025e161aa.1686766812/8ee79787c7c688ca999af5958b72bb73' does not exist

NativeFSLockFactory of Lucene 4.6 keeps track of all the locks in memory and any further attempt to obtain the (now inaccessible) lock will fail. It also explains why restarting the service can make the problem go away. (Even if the files are all gone, the internal database of the locks becomes out of sync with their state.)

Locks in the memory are stored in a hash table that are keyed on the path. The problem can be surfaced by trying to lock the same path that has been failed to unlock previously. Recreation of the same database with the same text index within the same second (i.e. on the same path) is a way to cause this.

Increasing the granularity in the shard suffix could only help to make it less likely that we are trying to obtain the lock on the same path. No 500s for the users on the side of CouchDB, users will not notice anything. But locks on paths will still keep leaking every time when the clean-up service wins, which may even lead to memory exhaustion on a larger scale...?

pgj commented 1 year ago

I think, in general, the extra clean-up on the same path should not be let to interfere with the termination. This would save us from the exception hence the overall problem of leaking locks.

pgj commented 1 year ago

Closing this in favor of #80.