biolink / kgx

KGX is a Python library for exchanging Knowledge Graphs
https://kgx.readthedocs.io
BSD 3-Clause "New" or "Revised" License
116 stars 28 forks source link

Neo4j Sink (neo_sink.py) CACHE flushing may result in information loss to the graph #320

Open RichardBruskiewich opened 3 years ago

RichardBruskiewich commented 3 years ago

The observed node and edge counts in target graphs seems smaller than expected with large graphs.

This may relate to the number of node records exceeding CACHE_SIZE (set to 100,000 by default but also programmatically mutable using the keyword arg 'cache_size', if given to the NeoSink constructor). Loss of edges results when edges are written before nodes are written to the output.

A possible solution is to ensure that all nodes encountered in the cache are flushed to the database first, before flushing of any batch of edges.

Note: it would likely be generally important for all node records to be read first from the Transformer Source prior to edge records. This concern was fixed in some places (e.g. tsv_source.py) during the --stream upgrade for KGX, but perhaps needs to be reviewed for other Source implementations.

YaphetKG commented 3 years ago

Hi @RichardBruskiewich , we were experiencing similar issues when trying to use KGX to build a graph from smaller sets.

For us the issue had to do with how the Neo4j cypher was doing the unwinds, when writing edges.

        UNWIND $edges AS edge
        MATCH (s:`biolink:NamedThing` {id: edge.subject}), (o:`biolink:NamedThing` {id: edge.object})
        MERGE (s)-[r:`biolink:subclass_of`]->(o)
        SET r += edge

This cypher would replace any edge between s and o if the edge type is already there. But there are instances were we were trying to have multiple edges of the same type between s and o , modifying the cypher to consider edge ids seemed to help. Also in Graph sink the way we generate edge keys have similar issue.

        UNWIND $edges AS edge
        MATCH (s:`biolink:NamedThing` {id: edge.subject}), (o:`biolink:NamedThing` {id: edge.object})
        MERGE (s)-[r:`biolink:subclass_of` {id: edge.id }]->(o)
        SET r += edge