Neo4j Sink (neo_sink.py) CACHE flushing may result in information loss to the graph

biolink / kgx

KGX is a Python library for exchanging Knowledge Graphs

BSD 3-Clause "New" or "Revised" License

116 stars 28 forks source link

The observed node and edge counts in target graphs seems smaller than expected with large graphs.

This may relate to the number of node records exceeding CACHE_SIZE (set to 100,000 by default but also programmatically mutable using the keyword arg 'cache_size', if given to the NeoSink constructor). Loss of edges results when edges are written before nodes are written to the output.

A possible solution is to ensure that all nodes encountered in the cache are flushed to the database first, before flushing of any batch of edges.

Note: it would likely be generally important for all node records to be read first from the Transformer Source prior to edge records. This concern was fixed in some places (e.g. tsv_source.py) during the --stream upgrade for KGX, but perhaps needs to be reviewed for other Source implementations.

Hi @RichardBruskiewich , we were experiencing similar issues when trying to use KGX to build a graph from smaller sets.

For us the issue had to do with how the Neo4j cypher was doing the unwinds, when writing edges.

        UNWIND $edges AS edge
        MATCH (s:`biolink:NamedThing` {id: edge.subject}), (o:`biolink:NamedThing` {id: edge.object})
        MERGE (s)-[r:`biolink:subclass_of`]->(o)
        SET r += edge

This cypher would replace any edge between s and o if the edge type is already there. But there are instances were we were trying to have multiple edges of the same type between s and o , modifying the cypher to consider edge ids seemed to help. Also in Graph sink the way we generate edge keys have similar issue.

        UNWIND $edges AS edge
        MATCH (s:`biolink:NamedThing` {id: edge.subject}), (o:`biolink:NamedThing` {id: edge.object})
        MERGE (s)-[r:`biolink:subclass_of` {id: edge.id }]->(o)
        SET r += edge

biolink / kgx

Neo4j Sink (neo_sink.py) CACHE flushing may result in information loss to the graph #320