blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
898 stars 173 forks source link

Concurrency / query execution bug: timeout can lead to incorrect results or hung query #201

Open edwinchoi opened 3 years ago

edwinchoi commented 3 years ago

The bug becomes can be reproduced by BigdataConnectionTest.testOrderByQueriesAreInterruptable, if you reduce MAX_QUERY_TIME to 1000 and increase NTRIALS, you will see (1) a failure because hasNext() completed by returning false incorrectly, and (2) the query hanging on hasNext() indefinitely. You'd have to take increase the JUnit timeout as well.

I presume it's prone to the same issue when canceling the queries.

Problem areas:

  1. ICloseableIterator changes the semantics of hasNext() and next() by throwing unchecked exceptions. Any iterator that depends on another ICloseableIterator and eagerly disposes needs to know to check for non-fatal exceptions when calling hasNext() and next(). The semantics of close() are fuzzy as well in relation to the Iterator methods. Should hasNext() throw an exception if the iterator has been closed? I would argue that it's acceptable to return false when the iterator has been fully consumed, and throw an exception otherwise. And when should close() throw an exception? Right now, it'll throw an exception if the query was pre-empted, which should really be something indicated while consuming (ramifications in the next section).

    For ICloseableIterator, the consumer is communicating that they are no longer interested in the output and to release the resources. If it throws an exception because one of the resources couldn't be cleaned up, the object is in an indeterminate state. The documentation for java.io.Closeable states to "relinquish the underlying resources and to internally mark the Closeable as closed, prior to throwing the IOException."

  2. AbstractChunkedResolverator should not close the buffer in the ChunkConsumerTask.call() method (the future that is set on the BlockingBuffer wraps the Callable). It creates a race condition where the consumer of the BlockingBuffer may detect the closure but fail to detect the failed future and return false in hasNext() when there was an exception. It should also make sure to dispose of all resources. Currently, if any close() fails then it will fail to dispose of any remaining resources.
  3. RunningQueryCloseableIterator demonstrates what I described in (1), and the failing close() causes (2) to fail to clean up the BlockingBuffer and its consumer.

The order of disposal of the resources in AbstractChunkedResolverator is questionable (same order as acquisition, and ownership is unclear). If whatever is consuming it is calling close, then I assume it's to signal that it's no longer interested in the elements... resolvedItr is the delegate, while buffer and src are upstream to it. The producer owns its own resources, but once coupled with the consumer, it takes over? Sounds fishy... in any case, it should close its end of the pipe first before disposing of the upstream.

Checking for side-effects through explicit methods would be more sensible - ICloseableIterator::hasFailure and ICloseableIterator::getFailure and/or use a state enum to represent the coupling state with the producer { INDET, BOUND, COMPLETED, FAILED, CANCELING, CANCELED }. The last two would be consumer-driven states.

Defining the contract, establishing invariants, and capturing as much of that as possible in a base class, and making it so any transformers/filters accept a single abstraction for the upstream that encapsulates methods to consume items from the producer and to know when the producer has terminated (normally or abnormally), and for the consumer to ask the producer to stop (without any side effects), would make the control flow easier to reason about.