Multiple edges with same edgeid using VertexCentric indices after modifications to MULTI edge

pankajydv commented 7 years ago

The issue reproduces randomly on my production environment (on Titan), and reproduces on my dev environment pretty easily even with the ConsistencyModifier of the edge set to LOCK.

Following is the code, I used to reproduce the issue on my local environment on JanusGraph 0.1.1, Backed by Apache Cassandra 2.1.7

//Edge
label = mgmt.makeEdgeLabel(HIE_CHILD).multiplicity(Multiplicity.MULTI).make();
mgmt.setConsistency(label, ConsistencyModifier.LOCK);

//VertexCentric Index
mgmt.buildEdgeIndex(label, "HSCCERDecr", Direction.BOTH, Order.decr,
    mgmt.getPropertyKey(HOSTID_E),
    mgmt.getPropertyKey(STATUS_E),
    mgmt.getPropertyKey(CMSTYPE_E),
    mgmt.getPropertyKey(HRANK));

The problem is If I try to concurrently modify an edge with n threads, I get n copies of the same edge, with same edgeid and then gremlin queries start behaving very abruptly. Like if the direct children on some vertex are m, and if I fire a query which uses the VertexCentricIndex the number of children returned are m+n (n is the number of threads I used for concurrent modifications). I used the following code to reproduce the issue.

JanusGraphTransaction trxn = graph.newTransaction();
GraphTraversalSource g = trxn.traversal();
Edge edge = (Edge) g.V().has("msid", parent).outE("hie_child").as("e")
    .has("hostid_e", hostid)
    .inV().has("msid", child)
    .select("e").next();

Long updatedAt = random.nextLong();

edge.property("hrank", random.nextInt());
edge.property("updatedAt_e", updatedAt);

trxn.commit();

Following is the behavior, I observed through gremlin console:

Before the concurrent modification, number of edges with or without the VertexCentric index are same.

gremlin> g.V().has('msid', 1000000).outE('hie_child').as('e').inV().has('msid', 1000002).select('e').count()
==>1
gremlin> g.V().has('msid', 1000000).outE('hie_child').has('hostid_e', 83).as('e').inV().has('msid', 1000002).select('e').count()
==>1

After the concurrent modification with 10 threads. I see 10 edges between the parent/child with VertexCentric index and only 1 wihtout the index.

gremlin> g.V().has('msid', 1000000).outE('hie_child').has('hostid_e', 83).as('e').inV().has('msid', 1000002).select('e').count()
==>10
gremlin> g.V().has('msid', 1000000).outE('hie_child').as('e').inV().has('msid', 1000002).select('e').count()
==>1

If I try to drop only a single edge, it seems as if all the edges are dropped.

gremlin> g.V().has('msid', 1000000).outE('hie_child').has('hostid_e', 83).as('e').inV().has('msid', 1000002).select('e').range(0,1).drop()
gremlin> g.V().has('msid', 1000000).outE('hie_child').has('hostid_e', 83).as('e').inV().has('msid', 1000002).select('e').count()
==>0
gremlin> g.V().has('msid', 1000000).outE('hie_child').as('e').inV().has('msid', 1000002).select('e').count()
==>0

But n-1 edges reapper through the VertexCentric index after the transaction is committed, but without using the index there is no edge.

gremlin> g.tx().commit()
==>null
gremlin> g.V().has('msid', 1000000).outE('hie_child').has('hostid_e', 83).as('e').inV().has('msid', 1000002).select('e').count()
==>9
gremlin> g.V().has('msid', 1000000).outE('hie_child').as('e').inV().has('msid', 1000002).select('e').count()
==>0

Please note within the transaction after deleting the affected edge, it seems all the edges are deleted, but after transaction commit, n-1 edges re-appear using the VertexCentric index query, but there is no edge if that VertexCentric Index is not used.

pankajydv commented 7 years ago

Any updates here, we are suffering from this issue in our Live environment.

pankajydv commented 6 years ago

Any update, on this?

mbrukman commented 6 years ago

@pankajydv — could you please try to see if it's possible to reproduce the same issue with JanusGraph 0.2.0 and/or master branch as well? A number of changes have been made, including upgrades to dependencies.

pankajydv commented 6 years ago

@mbrukman - thanks for the update, let's test this on the latest JanusGraph codebase and we'll let you know the results.

rahul38888 commented 6 years ago

Same issue is reproducible on JanusGraph 0.2.0 with Apache Cassandra 3.0.9 as well. The properties corresponding to are following:

storage.backend=cassandra
storage.cassandra.replication-strategy-class=org.apache.cassandra.locator.NetworkTopologyStrategy
storage.cassandra.replication-strategy-options= dc1, 2
storage.hostname=xxx.xxx.xxx.xxx
storage.cassandra.keyspace=janustest
storage.cassandra.frame-size-mb=300
storage.read-only=false
query.force-index=true

pankajydv commented 6 years ago

@mbrukman - one of my colleague tried this code with JanusGraph 0.2 and Apache Cassandra 3.0.9 and we are still seeing this issue - request you to please prioritize this as this sometimes affect our live production system and is tedious to fix. In our observation what goes wrong is 2 or more copies get indexed in the VertexCentric index having same edge id (if you query in such a way that this VertexCentric index is not used then only 1 edge is found) and starts giving confusing results. This is even after we have set ConsistencyModifier to LOCK for this edge, and it's not possible to change the ConsistencyModifer of the VertexCentric index. Please let us know if you need any more help/information in the form of logs etc from our side.

pankajydv commented 6 years ago

@rahul38888 please attach the code and config you used to reproduce this issue here in zip format.

rahul38888 commented 6 years ago

~~The code used to reproduce the issue is following.~~

This part has been removed because of the long text content. Same code can be found here

robertdale commented 6 years ago

First, this isn't a working example. It would be great if you had something where one could just do mvn test. Next, it's not clear what the expected is vs. actual. Again, if you had a failed test case with mvn test, it would be more clear. Finally, it looks like you're missing a consistency LOCK on edgeIndex "HSCCERDecr"

rahul38888 commented 6 years ago

We have created a working maven test to reproduce the issue. Please find the project attached with this.

Change the graph.properties in classpath accordingly.

testTemp.zip

pankajydv commented 6 years ago

@robertdale We have updated our code in Junit Testcase format, so now the issue is reproducible via mvn test. We tried setting consistency LOCK on edgeindex(VertexCentric index) "HSCCERDecr", but it seems this is not supported by Titan/JanusGraph, as we get the following exception when we try this:

gremlin> edge=mgmt.getRelationType('hie_child')
gremlin> index=mgmt.getRelationIndex(edge, 'HSCCERDecr')
gremlin> mgmt.setConsistency(index, ConsistencyModifier.LOCK)
Cannot change consistency of schema element: HSCCERDecr
Type ':help' or ':h' for help.
Display stack trace? [yN]y
java.lang.IllegalArgumentException: Cannot change consistency of schema element: HSCCERDecr
at org.janusgraph.graphdb.database.management.ManagementSystem.setConsistency(ManagementSystem.java:1115)

amcp commented 6 years ago

Isn’t consistency part immutable? I think you need to set the consistency when you create the index.

rahul38888 commented 6 years ago

Thanks @amcp . I tried to set consistency after building vertex-centric index, as following

RelationTypeIndex hie_childIndex = mgmt.buildEdgeIndex(label, "HSCCERDecr", Direction.BOTH, Order.decr,
                mgmt.getPropertyKey("hostid_e"),
                mgmt.getPropertyKey("status_e"),
                mgmt.getPropertyKey("cmstype_e"),
                mgmt.getPropertyKey("hrank"));

mgmt.setConsistency(hie_childIndex, ConsistencyModifier.LOCK);

This throws following exception:

java.lang.IllegalArgumentException: Cannot change consistency of schema element: HSCCERDecr
at org.janusgraph.graphdb.database.management.ManagementSystem.setConsistency(ManagementSystem.java:1115)

Also, there seems no implementation of JanusGraphManagement#buildEdgeIndex which would take consistency as an argument.

Its seems this is not allowed even when the index is being created.

Please guide me if I am missing something.

rahul38888 commented 6 years ago

The re-runnable code to reproduce the issue is attached below:

janusedgetest.zip

chupman commented 6 years ago

Hi @rahul38888 I'd like to help reproduce the issue. Would it be possible for you to put the code from your zip into a fresh github repository with the steps required to reproduce on a fresh system included in the readme?

pankajydv commented 6 years ago

@chupman I have uploaded this test-case to a fresh git repository. Please let us know if you need more information.

chupman commented 6 years ago

Thanks @pankajydv, I was able to recreate the issue with the repo. I was trying the same thing with the zip yesterday and it was failing before it got to the use case. I forked and uploaded the output xml since 2400 lines seemed a bit long for a Gist.

pankajydv commented 6 years ago

@chupman - are you still working on this. Can you please share, if there is an update.

pluradj commented 6 years ago

It doesn't look like the test code (JanusMultipleEdgesTest and JanusEdgeUpdator) is using multi-threaded transactions as described in the JanusGraph docs.

pankajydv commented 6 years ago

@pluradj - Sorry didn't get your point. Our use case suits a different transaction for every thread, allowing every thread to work independently.

@mbrukman - Request you to please assign this issue to someone so as to expedite the solution to this problem.

pankajydv commented 6 years ago

We tried to figure out the root cause of this problem by looking at the code of tag v0.2.0 and we see that the code in org.janusgraph.graphdb.database.StandardJanusGraph.prepareCommit(...) is not able to acquire edge LOCK through org.janusgraph.graphdb.database.StandardJanusGraph.acquireLock(InternalRelation, int, boolean) for the following operations

//1) Collect deleted edges and their index updates and acquire edge locks & //2) Collect added edges and their index updates and acquire edge locks

We modified the code of the acquireLock method to return true even in case of 'MULTI' edges if the edge consistency level is set to LOCK and could resolve this issue. Can anyone please let's know if this can lead to any discrepancy. Please note we have already tested MULTIPLE edges creation between a pair of vertices and it works fine with this change.

pankajydv commented 6 years ago

Looking at the code we can conclude that an edge with a multiplicity of MULTI will never be able to acquire LOCK. Can someone please throw some light on why is an edge with multiplicity MULTI is not supposed to acquire lock and if it is then how can we make the edge modifications acquire the LOCK. Please note edge consistency is already set to LOCK.

pluradj commented 6 years ago

@pankajydv Do you have a public fork with your changes? It would be easier for understanding what you are proposing, and ultimately could lead to a pull request from you.

pankajydv commented 6 years ago

@pluradj Please refer to my public fork with the changes corresponding to the comment.

chupman commented 6 years ago

HI @pankajydv, I'm curious why you're locking the edge label instead of the property. Based off the docs it seems like it's not recommended to rely on locking this way with Cassandra. Would using the consistency modifier FORK rather than LOCK work for your use case?

pankajydv commented 6 years ago

@chupman

I have tried setting LOCK on edge as well as on all of the edge properties, but the duplicate edges with the same edge-id issue still persist.
The issue doesn't look like a Cassandra level inconsistency issue. I can say that because if I make the aquireLock method to return true - the issue get's resolved. Hence it looks like the issue is because the edge modification path is not acquiring LOCK.
with edge consistency set to FORK, I am still getting multiple edges between these vertices the only difference is this time they all have different ids, this is expected behavior as per the docs. What we need is only a single edge between a pair of vertices depending on the value of an edge property.

graphdb2 commented 6 years ago

Hello, any new on this ? is the only way to handle concurrent update on an edge is to get duplicate edges and filter them out later ?

cjxqhhh commented 4 years ago

Any updates on this? It's an important issue with data consistency I think.

JanusGraph / janusgraph

Multiple edges with same edgeid using VertexCentric indices after modifications to MULTI edge #301