JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.26k stars 1.16k forks source link

Drop optimization #4426

Closed ntisseyre closed 4 months ago

ntisseyre commented 4 months ago

Created a new MultiVertexQuery drop API that efficiently drops many vertices.

porunov commented 4 months ago

Local benchmark execution:

Benchmark                        (isMultiDrop)  (verticesAmount)  Mode  Cnt    Score    Error  Units
MultiDropBenchmark.dropVertices           true              5000  avgt    5   71.137 ±  2.007  ms/op
MultiDropBenchmark.dropVertices          false              5000  avgt    5  900.047 ± 53.045  ms/op

Conclusion: locally deleting vertices via multiQuery performed >12 times faster than via Gremlin API.

ntisseyre commented 4 months ago

I would prefer adding a simple unit test also which tests the new API tx.multiQuery(vertices).drop() for better test coverage. Otherwise this PR looks good to me. After this PR is merged I think we could also overwrite TinkerPop's DropStep to leverage this new optimization in Gremlin queries without the need to using multuQuery directly (i.e. similar as we did with JanusGraphHasStep as it is also a filter step).

I'm not sure it will work the same way on individual DropStep, since the benefit of multi-query that it constructs a query once and applies to all vertices. For individual DropStep, it constructs a query every time for each individual vertex to fetch relations (system + standard), and that takes so much time.

porunov commented 4 months ago

I would prefer adding a simple unit test also which tests the new API tx.multiQuery(vertices).drop() for better test coverage. Otherwise this PR looks good to me. After this PR is merged I think we could also overwrite TinkerPop's DropStep to leverage this new optimization in Gremlin queries without the need to using multuQuery directly (i.e. similar as we did with JanusGraphHasStep as it is also a filter step).

I'm not sure it will work the same way on individual DropStep, since the benefit of multi-query that it constructs a query once and applies to all vertices. For individual DropStep, it constructs a query every time for each individual vertex to fetch relations (system + standard), and that takes so much time.

Yes, the default DropStep processes vertices one by one, but our multiquariable custom JanusGraph steps which overwrite TinkerPop steps extend the interface with additional methods to "pre-register vertices which will be processed later". Thus, the overwritten version of the DropStep will simply execute multiQuery for a batch of vertices and the vertices which are already processed will be skipped until a new batch is registered. We use this technique for many custom steps (JanusGraphHasStep, JanusGraphPropertiesStep, JanusGraphVertexStep, etc.). After this PR is merged I can implement JanusGraphDropStep as it is quite easy to do if we already have the necessary multiQuery API and I already done that for multiple steps. So, I think it shouldn't be a problem unless I am missing something.