JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.28k stars 1.16k forks source link

GraphMLReader cannot interpret id attribute of node element in GraphML as long value #1035

Open XComp opened 6 years ago

XComp commented 6 years ago

I wanted to load GraphML file with given vertex IDs (the set IDs are in the valid JanusGraph format). A node element in this GraphML file looks like this.

<!-- ... -->
<node id="512">
  <data key="foo">bar</data>
</node>
<!-- ... -->

I always get a IllegalArgumentException saying Must provide vertex id. I investigated the case and figured out that there's some wrong (?) behavior happening when org.apache.tinkerpop.gremlin.structure.io.graphml.GraphMLReader.findOrCreate(..) is called while reading the graph and processing a END_ELEMENT for node sub-elements (see org.apache.tinkerpop.gremlin.structure.io.graphml.GraphMLReader.readGraph(..):

                    // ...
                } else if (eventType.equals(XMLEvent.END_ELEMENT)) {
                    final String elementName = reader.getName().getLocalPart();

                    if (elementName.equals(GraphMLTokens.NODE)) {
                        final String currentVertexId = vertexId;
                        final String currentVertexLabel = Optional.ofNullable(vertexLabel).orElse(Vertex.DEFAULT_LABEL);
                        final Object[] propsAsArray = vertexProps.entrySet().stream().flatMap(e -> Stream.of(e.getKey(), e.getValue())).toArray();

                        findOrCreate(currentVertexId, graphToWriteTo, vertexFeatures, cache,
                                true, ElementHelper.upsert(propsAsArray, T.label, currentVertexLabel));
                        // ...

The currentVertexId is a String value in the code above which was extracted from the start element's attribute list (respectively the id attribute) of the currently processed sub-element. findOrCreate expects the vertex ID to be an Object, then. Within findOrCreate the passed vertex features are then used to check whether the passed vertex ID is valid which is never the case since the StandardJanusGraph implementation expects the vertex ID to be a long value.

Is this a conceptual bug in the StandardJanusGraph implementation? Or is this kind of expected behavior?

pluradj commented 6 years ago

Do you have an example GraphML that demonstrates the issue? Are you loading the file with graph.io(graphml()).readGraph("mygraph.xml") or some other way?

The GraphML files from TinkerPop found under the data directory all load fine.

XComp commented 6 years ago

I created a simple example with the tinkerpop-modern.xml found in the janusgraph repository using the following code:

import org.apache.tinkerpop.gremlin.structure.io.IoCore;
import org.janusgraph.core.JanusGraph;
import org.janusgraph.core.JanusGraphFactory;
import java.io.IOException;

public class TinkerpopGraphMLImportTest {
    public static void main(String[] args) throws IOException {
        JanusGraphFactory.Builder configBuilder = JanusGraphFactory.build();
        configBuilder
                .set("gremlin.graph", "org.janusgraph.core.JanusGraphFactory")
                .set("storage.backend", "inmemory")
                .set("graph.set-vertex-id", "true");
        try (JanusGraph graph = configBuilder.open()) {
            System.out.println("[Pre-import]  Node count: " + graph.traversal().V().count().next());
            graph.io(IoCore.graphml()).readGraph("core/src/test/resources/tinkerpop-modern.xml");
            System.out.println("[Post-import] Node count: " + graph.traversal().V().count().next());
        }
    }
}

The problem seems to be the graph.set-vertex-id. If disabled, the import succeeds and the expected number of vertices (6) is printed. If enabled, I get the same IllegalArgumentException being thrown having the following stacktrace:

Exception in thread "main" java.lang.IllegalArgumentException: Must provide vertex id
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:135)
    at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:515)
    at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsTransaction.addVertex(JanusGraphBlueprintsTransaction.java:123)
    at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph.addVertex(JanusGraphBlueprintsGraph.java:129)
    at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph.addVertex(JanusGraphBlueprintsGraph.java:47)
    at org.apache.tinkerpop.gremlin.structure.io.graphml.GraphMLReader.findOrCreate(GraphMLReader.java:310)
    at org.apache.tinkerpop.gremlin.structure.io.graphml.GraphMLReader.readGraph(GraphMLReader.java:177)
    at org.apache.tinkerpop.gremlin.structure.io.graphml.GraphMLIo.readGraph(GraphMLIo.java:92)
    at TinkerpopGraphMLImportTest.main(TinkerpopGraphMLImportTest.java:16)

This underlines the observed behavior of this issue. Probably, the tests on GraphML loads is not tested on custom vertex IDs?

XComp commented 6 years ago

I also grepped for the XML files in the JanusGraph repository. Non of the XML files seem to be loaded anywhere:

$ git clone git@github.com:JanusGraph/janusgraph.git
$ cd janusgraph
$ grep -R 'tinkerpop-[a-z]*.xml' .
Binary file ./.git/index matches

Is this an indication that the GraphML is not tested at all? Or am I misled here?

pluradj commented 6 years ago

Thanks for the full test case. I wouldn't be surprised if there aren't any test cases in JanusGraph specifically around GraphML loading, with or without the graph.set-vertex-id property.

The GraphML functionality is coming from TinkerPop. The code in GraphMLReader assumes that the vertex id is a String.

You could consider using GraphSON if you want to do vertex id assignment.