AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() [LUCENE-10292]

asfimport commented 2 years ago

I'm filing this based on anecdotal information from a Solr user w/o experiencing it first hand (and I don't have a test case to demonstrate it) but based on a reading of the code the underlying problem seems self evident...

With all other Lookup implementations I've examined, it is possible to call lookup() regardless of whether another thread is concurrently calling build() – in all cases I've seen, it is even possible to call lookup() even if build() has never been called: the result is just an "empty" List<LookupResult>

Typically this is works because the build() method uses temporary datastructures until it's "build logic" is complete, at which point it atomically replaces the datastructures used by the lookup() method. In the case of AnalyzingInfixSuggester however, the build() method starts by closing & null'ing out the protected SearcherManager searcherMgr (which it only populates again once it's completed building up it's index) and then the lookup method starts with...

    if (searcherMgr == null) {
      throw new IllegalStateException("suggester was not built");
    }

... meaning it is unsafe to call AnalyzingInfixSuggester.lookup() in any situation where another thread may be calling AnalyzingInfixSuggester.build()

Migrated from LUCENE-10292 by Chris M. Hostetter (@hossman), resolved Apr 08 2022 Attachments: LUCENE-10292.patch, LUCENE-10292-1.patch, LUCENE-10292-2.patch, LUCENE-10292-3.patch

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

It seems like at a minimum we should make AnalyzingInfixSuggester.lookup() return an empty List in the searcherMgr == null case – but it also seems like it should be possible/better to change AnalyzingInfixSuggester.build() so that the searcherMgr is only repalced after we build the new index (and/or stop using a new IndexWriter on every AnalyzingInfixSuggester.build() call and just do a writer.deleteAll() instead?)

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

seems like it should be possible/better to change AnalyzingInfixSuggester\.build() so that the searcherMgr is only repalced after we build the new index (and/or stop using a new IndexWriter on every AnalyzingInfixSuggester\.build() call and just do a writer\.deleteAll() instead?)

Digging into the code a bit more i realized:

Using writer.deleteAll() isn't really a viable "fix" because (by default) the writer doesn't exist (or is closed after build()
Another aspect of the existing impl is that if thread T1 enters the lookup() method and makes it past the if (searcherMgr == null) check, but then another thread T2 calls build() before the T1 reaches the synchronized block, then T1 will be blocked for the entire durration of the build() ... which is certainly less then ideal.

I worked up a patch with a test case to try and demonstrate the problem, as well as a fix – note that as written the test case won't always "fail" cleaning with the existing impl, it can also deadlock because of how I'm using a Semaphore to "slow" down the build() – triggering the blocked lookup thread situation described above.

The basic idea of the fix s to decouple the synchronization/locking needed for changing the writer from the synchronization/locking needed to change the SearchManager. and replace the SearchManager only once the build() is complete.

I originally tried to replace the "R/W" locking of SearcherManager with an AtomicReference so we wouldn't need to have any synchronization blocks in lookup() at all; but I couldn't figure out a "safe" way to do that w/o ref counting the SearcherManager itself (to ensure that no build/add calls close the SearcherManager that lookup() gets a ref to before it has a change to acquire() on it (thank you testRandomNRT for catching that for me)

Feedback on this approach (or suggestions for better ways to solve this) welcome

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I originally tried to replace the "R/W" locking of SearcherManager with an AtomicReference so we wouldn't need to have any synchronization blocks in lookup() at all; but I couldn't figure out a "safe" way to do that w/o ref counting the SearcherManager ...

I don't know why it didn't occurto me yesterday, but the obviuos solution to this type of situation is a ReadWriteLock ... patch updated to use a writeLock() when replacing the SearcherManager and a readLock() in lookup() (and getCount()

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I refactored the test code so that it could be applied to all other Lookup impls (that have a build() method) and found that while none of the other impls had the same problem of .lookup() failing to return suggestions during a (re)build, a few FST based Lookups have getCount() impls that return results that inconsistent from .lookup() due to incrementing a count variable gradually during build().

This latest patch (in addition to the expanded testing) fixes those build() methods to update their count value only after replacing the fst in use.

asfimport commented 2 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

It's surprising it took so long for someone to stumble over this. Maybe because most (as I used to do) rebuild their suggesters off line as part of a batch build process? Anyway I looked over the patch and didn't see any problem except a typo in tests "testDurringReBuild" should probably be "testDuringRebuild"? Thanks for the nice tests!

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

... except a typo in tests "testDurringReBuild" should probably be "testDuringRebuild"?

HA! .. yeah, thanks. I renamed them all testLookupsDuringReBuild()

I'll plan to commit to main & backport to 9x tomorrow unless there is any other feedback.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 5015dc6dbb89a2d3f9c2cd0eb1674f6f146d09b4 in lucene's branch refs/heads/main from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene.git;h=5015dc6dbb8

LUCENE-10292: Suggest: Fix AnalyzingInfixSuggester / BlendedInfixSuggester to correctly return existing lookup() results during concurrent build()

Fix other FST based suggesters so that getCount() returned results consistent with lookup() during concurrent build()

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit b11b0714297ab5f8005aa01001824acc6ef488e7 in lucene's branch refs/heads/branch_9x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene.git;h=b11b0714297

LUCENE-10292: Suggest: Fix AnalyzingInfixSuggester / BlendedInfixSuggester to correctly return existing lookup() results during concurrent build()

Fix other FST based suggesters so that getCount() returned results consistent with lookup() during concurrent build()

(cherry picked from commit 5015dc6dbb89a2d3f9c2cd0eb1674f6f146d09b4)

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

thanks for the review mike!

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

@hossman - don't know if you saw the recent discussion on the mailing list - how did you arrive at the conclusion that Lookup.build can be called concurrently? I don't think this is mentioned anywhere in Lookup documentation and I don't think the implementation is thread-safe (at least not the TestFreeTextSuggester)?

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't see any evidence in implementations of Lookup that build() can be called in a thread-safe manner.

Those testLookupsDuringReBuild are only working by a lucky chance (and rarely still fail!). The code typically releases semaphore permissions quickly here:

    // at every stage of the slow rebuild, we should still be able to get our original suggestions
    for (int i = 0; i < data.size(); i++) {
      initialChecks.check(suggester);
      rebuildGate.release();
    }

while the build() method is not even invoked yet because this line:

                suggester.build(
                    new InputArrayIterator(new DelayedIterator<>(suggester, rebuildGate, data.iterator())));

is semaphore-blocked in the constructor parameters (InputArrayIterator). So the result is that for suggester.build() is typically entered a long time after the check look has finished. It is enough to modify the code to:

    // at every stage of the slow rebuild, we should still be able to get our original suggestions
    for (int i = 0; i < data.size(); i++) {
      rebuildGate.release();
      Thread.sleep(100);
      initialChecks.check(suggester);
    }

to cause repeatable failures (this isn't a suggested fix but a demonstration that the code is currently broken).

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

... how did you arrive at the conclusion that Lookup.build can be called concurrently?

Well, my initial understand/impression was that calling lookup concurrent to a (re)build should be "ok" was based on the fact that it worked for every Suggester I could find except AnalyzingInfixSuggester because all other suggesters atomically replaced their internal structures (typically FSTs) at the end of their build() process – only AnalyzingInfixSuggester blew away it's internal data (it's searcherMgr) at the start of the build method, replacing it only at the end of the build – and had a lookup method that was throw an exception if you tried to use it during a (re)build.

As i said in my initial comment, it seemed like at a minimum we could/should make AnalyzingInfixSuggester return Collections.emptyList() in the case that searcherMgr was null (consistent with the other Lookup impls i found and what their lookup methods returned if they had never been built) but it seemed very easy/possible to make AnalyzingInfixSuggester support lookup concurrent w/ (re)build – so i added it (since there were no objections)

As i mentioned in subsequent comments (above) – while working on the test for this (and refactoring it so that it could be applied to all suggesters) i found that while i couldn't trigger any failures like this in other Lookup impls during concurrent calls to build()/lookup() i did discover some FST based suggesters (like AnalysingSuggester and FSTCompletionLookup) had inconsistencies between the getCount() and lookup() since the "count" variable was being updated iteratively while the fst was being replaced atomicly –which i only noticed because my new test changes were also sanity checking the count to confirm that the "new" data was in fact getting picked up by the time build finished.

I attempted to "improve" those inconsistencies as well, in the two analyzers where i noticed it, by replacing the "count" variable atomically as well .... but I evidently overlooked that FreeTextSuggester has this same code pattern .

(And yes, my "improvement" isn't a perfect "fix" because the fst & count variables are updated independently w/o any synchronization – but it didn't seem worth the overhead of adding that since there's nothing in the docs that implies/guarantees anything about what getCount will return during a build – but it certainly seemed wrong to me that those impls would happily return lots of results from lookup calls while getCount() might return '0')

...while the build() method is not even invoked yet because this line: ... is semaphore-blocked in the constructor parameters (InputArrayIterator).

...ah, yes, i overlooked that.

The spirit of what I was trying to do with this test is assert that the various "checks" all pass even while the build method is in the process of consuming an iterator. I initially implemented the test with both the Semaphore and sleep calls, but didn't want to leave them in and make the test "slow" – and when removing the sleep calls I clearly didn't give enough thought to the possibility that the "main" thread would have a chance to release all it's permits before the background thread had acquired any of them.

I've got an improved version of the test locally that uses bi-directional Semaphores to ensure the checks are tested between every call to next() (and wraps the InputArrayIterator instead of vice-versa so it doesn't get hung up on the behavior of that classes constructor) which reliably triggers the current sporadic jenkins failures in TestFreeTextSuggester – and making the same improvements to how that FreeTextSuggester tracks its "count" that my previous commits made to AnalysingSuggester and FSTCompletionLookup causes the modified test to reliably pass.

I'll commit these changes to main & 9x ASAP to stop the jenkins failures – but if you would like to re-open this issue for further discussion (and/or revert all of these commits in the meantime) I understand

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5 in lucene's branch refs/heads/main from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene.git;h=a8d86ea6e8b

LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build()

Fix SuggestRebuildTestUtil to reliably surfice this kind of failure that was previously sporadic

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 65f5a3bcc5afc17888258319d76e4a6293490d12 in lucene's branch refs/heads/branch_9x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene.git;h=65f5a3bcc5a

LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build()

Fix SuggestRebuildTestUtil to reliably surfice this kind of failure that was previously sporadic

(cherry picked from commit a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5)

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks Chris. I'm still not sure whether these tests make sense without explicitly stating that build() can be called on Lookup to dynamically (and concurrently) replace its internals... For example, FSTCompletionLookup:

      // The two FSTCompletions share the same automaton.
      this.higherWeightsCompletion = builder.build();
      this.normalCompletion =
          new FSTCompletion(higherWeightsCompletion.getFST(), false, exactMatchFirst);
      this.count = newCount;

none of these fields are volatile or under a monitor, so no guaranteed flush occurs anywhere. I understand eventually they'll get consistent by piggybacking on some other synchronization/ memfence but it's weird to rely on this behavior. I think it'd be a much more user-friendly API if Lookup was actually detached entirely from its build process (for example by replacing the current build method with a builder() that would return a new immutable Lookup instance). This would be less confusing and would also allow for a cleaner implementation (no synchronization at all required - just regular assignments, maybe even with final fields).

I'm not saying this should be implemented here - perhaps it's worth a new issue to do this refactoring.

Separately from the above, if the test fails, it'll leak threads - this:

acquireOnNext.acquireUninterruptibly();

literally blocks forever. It should be replaced with a try/catch that rethrows an unchecked exception when the iterator thread is interrupted.

asfimport commented 2 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I'm still not sure whether these tests make sense without explicitly stating ...

All I was really trying to do with these tests was demonstrate that data you get out of the Lookup before you call build(), can still be gotten from the Lookup while build() is incrementally consuming an iterator (which may take a long time if you are building up from a long iterator) and that this behavior is consistent across Lookup impls (as opposed to before i filed this issue, when most Lookups worked that way, but AnalyzingInfixSuggester would throw an ugly exception – which was certainly confusing to users who might switch from one impl to another).

I didn't set out to make any hard & fast guarantee about the thread safety of all lookups – just improve the one that awas obviously inconsistent with the others (progress, not perfection)

I'm happy to rename/re-document/remove the test code if you'd like - or confine it to just AnalyzingInfixSuggester - but i think (assume?) we can probably agree that the src/main code changes I've made in this issue are generally for the better?

I think it'd be a much more user-friendly API if Lookup was actually detached entirely from its build process (for example by replacing the current build method with a builder() that would return a new immutable Lookup instance). ... I'm not saying this should be implemented here - perhaps it's worth a new issue to do this refactoring.

+1 ... although some of the lookup impls explicitly support update()int individual suggestions – so you'd either have to remove that functionality or refactor the API in some way that it can still be optional.

Separately from the above, if the test fails, it'll leak threads ... It should be replaced with a try/catch that rethrows an unchecked exception

Yeah, that was lazy & sloppy of me – sorry about that. I'll fix it ASAP.

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 6afb9bc25a5935a3cba0a06e67b27fe290255467 in lucene's branch refs/heads/main from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene.git;h=6afb9bc25a5

LUCENE-10292: prevent thread leak (or test timeout) if exception/assertion failure in test iterator

asfimport commented 2 years ago

ASF subversion and git services (migrated from JIRA)

Commit 115bcd9c66d9723a7a1ea5131aa2f5d16a6867b7 in lucene's branch refs/heads/branch_9x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene.git;h=115bcd9c66d

LUCENE-10292: prevent thread leak (or test timeout) if exception/assertion failure in test iterator

(cherry picked from commit 6afb9bc25a5935a3cba0a06e67b27fe290255467)

asfimport commented 2 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> All I was really trying to do with these tests was demonstrate that data you get out of the Lookup before you call build(), can still be gotten from the Lookup while build() is incrementally consuming an iterator (which may take a long time if you are building up from a long iterator) and that this behavior is consistent across Lookup impls (as opposed to before i filed this issue, when most Lookups worked that way, but AnalyzingInfixSuggester would throw an ugly exception – which was certainly confusing to users who might switch from one impl to another).

I guess I am not comfortable with the fact that this test works only by a lucky coincidence and tests the behavior that isn't guaranteed or documented by the Lookup class - this got me confused and I guess it'll confuse people looking at this code after me. It's not a personal stab at you, it's just something that smells fishy around this code in general.

When I was looking at the failure and tried to debug the test, I didn't see the reason why this test was necessary (I looked at the Lookup class documentation). When I understood what the test did, I looked at the implementations and they seemed to be designed with a single-thread model in mind (external synchronization between lookups and rebuilds).

For example, even now, if you had a tight loop in one thread calling lookup on an FSTCompletionLookup and this loop got compiled, then there's nothing preventing the compiler from reading higherWeightsCompletion and normalCompletion fields once and never again (they're regular fields in FSTCompletionLookup), even if you call build there multiple times in between... Is this likely to happen? I don't know. Is this possible? Sure. Maybe I'm oversensitive because I grew up on machines with much less strict cache coherency protocols but code like this makes me itchy.

> I didn't set out to make any hard & fast guarantee about the thread safety of all lookups – just improve the one that awas obviously inconsistent with the others (progress, not perfection)

That's my point. Either we should make the Lookup interface explicitly state that it's safe to call the build method from another thread or we shouldn't really guarantee (or test) this behavior. I don't want you to revert the changes you made but my gut feeling is that lookup implementations should be designed to be single-threaded or at least immutable (one publisher-multiple readers model) as it makes implementing them much easier - no volatiles, synchronization blocks, etc.

Concurrency concerns should be handled by the code that uses Lookups - this code should know whether synchronization or two concurrent instances are required (one doing the lookups, potentially via multiple threads, one rebuilding). Perhaps a change in the API is needed to separate those two phases (build-use) and then the downstream code has to take care of handling/ swapping out Lookup reference where they're used - I don't know, I just state what I think.

asfimport commented 2 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Bulk close for 9.2.0 release

apache / lucene

AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() [LUCENE-10292] #11328