amazon-archives / dynamodb-janusgraph-storage-backend

The Amazon DynamoDB Storage Backend for JanusGraph
Apache License 2.0
446 stars 99 forks source link

Titan DynamoDB does not release all acquired locks on commit (via gremlin) #58

Closed jfstephe closed 7 years ago

jfstephe commented 7 years ago

This issue is explained in detail on StackOverflow. It seems like something isn't correct in the dynamo world (Berkeley backend looks ok).

The problem results in a lock not being released at the correct time and when subsequent transactions are run you get: tx 0x705eafda280e already locked key-column ( 8- 0- 0- 0- 0- 0- 0-128, 80-160) when tx 0x70629e1d56bf tried to lock

When the offending transaction is run, in the logs I can see that 3 locks acquired when opening the transaction:

[33mtitan_server_1 |[0m 120479 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - acquiring lock on ( 8- 0- 0- 0- 0- 0- 0-128, 80-160) at 123552624951495 [33mtitan_server_1 |[0m 120489 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - acquiring lock on ( 6-137-160- 48- 46- 48- 46-177, 0) at 123552635424334 [33mtitan_server_1 |[0m 120489 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - acquiring lock on ( 6-137-160- 48- 46- 48- 46-178, 0) at 123552635704705

...but only 2 are released when committing it:

[33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreTransaction - commit id:0x705eafda280e [33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - Expiring ( 6-137-160- 48- 46- 48- 46-177, 0) in tx 0x705eafda280e because of EXPLICIT [33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore - Expiring ( 6-137-160- 48- 46- 48- 46-178, 0) in tx 0x705eafda280e because of EXPLICIT [33mtitan_server_1 |[0m 120722 [gremlin-server-exec-3] DEBUG org.apache.tinkerpop.gremlin.server.op.AbstractEvalOpProcessor - Preparing to iterate results from - RequestMessage{, requestId=09f27811-dcc3-4e53-a749-22828d34997f, op='eval', processor='', args={gremlin=g.V().hasLabel("databaseMetadata").has("version", "0.0.1").property("version", "0.0.2").next();g.tx().commit();, batchSize=64}} - in thread [gremlin-server-exec-3]

Please can someone take a look at this? If you could indicate that you are looking into this that would be good as I get the impression that this project isn't given that much attention (apologies if that's not the case).

jfstephe commented 7 years ago

@amcp - Can you confirm that someone is looking into this? Do I need to look into this myself (somewhat difficult)? Please can you let me know so we aren't duplicating efforts :-).

jfstephe commented 7 years ago

Ok. I've done a bit more digging..... I added some extra logging in AbstractDynamoDBStore.java and I see 2 acquireLock calls from:

com.thinkaurelius.titan.diskstorage.BackendTransaction.acquireEdgeLock(BackendTransaction.java:221)

The first one stores the key-column in the tx.put in acquireLock(), the second one doesn't as it's already there. These both correspond o the first 'acquiring lock' log messages from the original post for this issue. This means that there's an entry in keyColumnLocalLocks for that keyColumn. All good so far. For the record the next 2 locks are for:

com.thinkaurelius.titan.diskstorage.BackendTransaction.acquireIndexLock(BackendTransaction.java:245)

but I don't think that's relevant ATM.

I also added some logging when 'releaseLock()' is called in the same class. This is where things get weird:

[33mtitan_server_1  | 90331 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - RELEASING LOCK KC: (  0-  0-  0-  0-  0-  0-  0-128, 80-160)
[33mtitan_server_1  | 90332 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - RELEASING LOCK KC: (  6-137-160- 48- 46- 48- 46-177,  0)
[33mtitan_server_1  | 90332 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - Expiring (  6-137-160- 48- 46- 48- 46-177,  0) in tx 0x160c512584a70 because of EXPLICIT
[33mtitan_server_1  | 90332 [gremlin-server-exec-3] DEBUG com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - RELEASING LOCK KC: (  6-137-160- 48- 46- 48- 46-178,  0)
[33mtitan_server_1  | 90332 [gremlin-server-exec-3] TRACE com.amazon.titan.diskstorage.dynamodb.AbstractDynamoDBStore  - Expiring (  6-137-160- 48- 46- 48- 46-178,  0) in tx 0x160c512584a70 because of EXPLICIT

Note that all 3 locks should get released. but only the last 2 do. I even added a 'keyColumnLocalLocks.cleanUp();' after the invalidate() but that didn't do anything (I'm no java guy :-) ).

For me this is all pointing to this library being the source of the issue. I just can't see why the cache wouldn't get invaidated. I know items don't get removed immediately when invalidated (although I think calling cleanUp() should poke it), but it should be removed at some point before the trouble starts.

I had assumed that google guava would be quite good, but looking at the number of serious sounding defects this could be the issue.

@amcp - can you make any suggestions?

amcp commented 7 years ago

Thank you for the deep dive. As soon as I come up for steam, I can take a look at this (probably on the weekend). Meanwhile, let me see if I can get someone else to take a look at this.

amcp commented 7 years ago

@jfstephe can you add a test case in a PR that reproduces this issue please? It will help our investigation.

jfstephe commented 7 years ago

@amcp Thanks for getting back to me. It would be great if you could get someone to look at this. I've had trouble building locally let alone run the tests (me + intelliJ rather than the project). Much more hopefully useful information follows.

I am using docker with a Dockerfile based on this . After running up I run the commands from the SO post and I get the errors seen. Note that in the SO post a kind soul has already translated my gremlin console statements into java, so that may help with test writing. In all honesty the tests aren't particularly clear IMHO. I would expect them to be written against some titan API spec, but hey ho.

Anyway, I'm getting to the bottom of this! Using a bit of a build hack, I replaced the guava cache with ConcurrentHashMaps and I have the same issue - so it wasn't that (relief!).

Next, I decided to print out the content of the hash map when things get invalidated/queried, and record how many instances of the AbstractDynamoDBStore class are created, and which locks/releases are recorded against them. I did this using a static count and assigned a member variable 'count' in the constructor from the static count whilst incrementing the static count. The results are interesting. I don't know enough about how titan expects this plugin to work, so I can't comment too strongly, but it seems that the locks are being acquired using one instance and attempted to be released using another. Perhaps this is a titan issue after all? I'm really grasping at straws now. If I don't get this resolved soon, we won't be using titan for our graph DB and will have to work out a different solution for our needs. Anyway, I've attached the log file. If you search for 'JFS' (my initials) those are 'my' log statements. In the run that produced that log file the transaction error line was:

60 JFS tx 0x23f807deedf5 already locked key-column (184- 0- 0- 0- 0- 0- 0-128, 80-160) when tx 0x23f8b82a5baa tried to lock

Search for '60 JFS' - these seem to be the edge locks, and '61 JFS' is for index locks. You see the 2 locks created for '60 JFS' from line 23107. Originally I thought this was one call lock request in the original logs, but those don't indicate when a request is made to lock for an already locked key in the same transaction. These edge locks should (I believe) be being removed at line 23449, but only the index locks are being removed and not for the edge lock key?!?!

As you are working on Janus graph perhaps you're best positioned to know where the issue lies.

Note: You will see quite a few 'RuntimeException' in the logs. These aren't real - I was using it as a quick and dirty way to get the stack trace. I've also attached my version of AbstractDynamoDBStore for reference.

AbstractDynamoDBStore.java.txt logOut.bingo.txt

Finally, if someone is able to look into this it would be of great help to understand when a fix may be available (an indication of 'tomorrow', 'next week', 'next month', 'next year', 'never' would be great - but hopefully closer to the left hand side of the scale ;-) ).

amcp commented 7 years ago

While we investigate this issue, I believe a viable workaround may be available in the form of titan built in locking. Please check out the locking branch and see if you are able to observe the desired behavior. Related to #64

amcp commented 7 years ago

@jfstephe I rebased the locking branch on the tip of 1.0.0.

jfstephe commented 7 years ago

Many thanks! I've been away from this for a while but hope to return in the coming couple of months(!) I'll re-run the tests I have when I get a chance.

amcp commented 7 years ago

I've merged JanusGraph locking compatibility to the JanusGraph branch. I will investigate potential issues with Guava and JanusGraph/Titan dependencies independently. You can let me know if using JanusGraph locks works for you.

amcp commented 7 years ago

I got repro and will post a PR in a few minutes

amcp commented 7 years ago

https://github.com/awslabs/dynamodb-titan-storage-backend/pull/148

amcp commented 7 years ago

Found a solution and demonstrated repro and a fix so I am closing this issue.