Closed kgenge closed 7 years ago
What is your definition of handling interrupts gracefully? What behavior do you expect?
@kgenge Please submit a PR with repro/test case that elicits this behavior.
@kgenge Ghost vertices are part of the design of Titan, as it was built with distributed and eventually consistent storage backends in mind. Consistency is at the column (vertex property or edge) level in the multiple item model and at the out-vertex level in the single item model. The recommendations in Titan apply, and we recommend you use Faunus/Titan Hadoop to periodically clean up your graph.
Having said all that, I will abstract the specific inquiry about being gracefully interruptible in parallelMutate into a different issue.
@kgenge Locking fixes I will implement in the very near future might enable you to implement better semantics with a lock store.
@kgenge Depending on the size of your vertices and the average degree of your graph, using the single item model may help a lot, as all the vertex properties get put into the same DynamoDB item.
Hello,
We have encountered an issue with how this dynamodb-titan-storage-backend library handles interrupts to the parrallelMutate method in the "DynamoDBDelegate.java" code. As can be seen in the method (https://github.com/awslabs/dynamodb-titan-storage-backend/blob/1.0.0/src/main/java/com/amazon/titan/diskstorage/dynamodb/DynamoDBDelegate.java#L285), interrupts to this method by titan are not handled gracefully. As per line 300, "// fail out because titan does not poll this thread for interrupted anywhere".
We are using the Hystrix library (https://github.com/Netflix/Hystrix) to protect our service from overloading. Every Titan transaction runs in a separate Hystrix command in our app. These commands are interrupted on timeout, so if any latency problem occurs, we are failing fast. This interruption may occur in various different moments of a TitanDB transaction. However, when the interrupt occurs, we are getting the following error on the parrallelMutate method (error trace shortened here for brevity, as indicated by "...", full stacktrace can be found at http://pastebin.com/LLzZXkiY):
We did some further investigations and we found out that if that interruption occurs during a commit phase, it may lead to data inconsistency. That data inconsistency may show up as an inconsistency between Titan's composite index and real data (vertex is returned by index, but is not present in database anymore) or by incomplete vertex without all provided properties. In essence, it seems that while out thread is interrupted, the data is still inserted in the DynamoDB table during the operation. However, the rollback fails due to the backend exception caused by the parrallelMutate issue mentioned above.
We created a sample project to reproduce that problem, but to be fair it can be quite hard to reproduce it consistently. With this sample project (https://github.com/marcinczernecki/titan-dynamodb-hystrix-issue), we were able to reproduce it once or twice, but we were running it over and over all day to achieve that. It's a Spring Boot application that creates an index (if it's not present) and runs Hystrix commands that wrap a Titan transaction with single vertex creation. The command has a configurable timeout (persistenceTimeout), so if you choose a proper value for this timeout, it should be interrupted during the commit phase (but it depends on your network latency to DynamoDB). When all commands finish we are checking our data consistency within the DynamoDB table. Most of the time it seems to work OK but sometimes it leaves some inconsistency within the DynamoDB table. It might be that this inconsistency is more likely to occur when it's the first run and tables in DynamoDB doesn't exists yet (but that's not always the case).
When we use Cassandra DB as our TitanDB's backend we do not have such a problem, and we are using same application and Titan's settings there.
While we could potentially increase the persistenceTimeout value to a large value, this is not a suitable option for us. Extending the timeout to a a larger value would be contrary to our main goal which is to help our service stay responsive with Hystrix. Besides, it would likely negatively impact our applications performance which may make it not practical for our use-case.
Can this method ("parrallelMutate" and wherever else applicable) be altered please to handle any interrupts gracefully? As it stands, this issue is even failing the resulting rollback process and is thus causing our DynamoDB table data to be inconsistent/out-of-sync with our application. This can often result in unexpected data within our table, "ghost vertices" in our graphs, etc.
Many thanks,