Open RichardBruskiewich opened 3 years ago
Hi @RichardBruskiewich , we were experiencing similar issues when trying to use KGX to build a graph from smaller sets.
For us the issue had to do with how the Neo4j cypher was doing the unwinds, when writing edges.
UNWIND $edges AS edge
MATCH (s:`biolink:NamedThing` {id: edge.subject}), (o:`biolink:NamedThing` {id: edge.object})
MERGE (s)-[r:`biolink:subclass_of`]->(o)
SET r += edge
This cypher would replace any edge between s and o if the edge type is already there. But there are instances were we were trying to have multiple edges of the same type between s
and o
, modifying the cypher to consider edge ids seemed to help. Also in Graph sink the way we generate edge keys have similar issue.
UNWIND $edges AS edge
MATCH (s:`biolink:NamedThing` {id: edge.subject}), (o:`biolink:NamedThing` {id: edge.object})
MERGE (s)-[r:`biolink:subclass_of` {id: edge.id }]->(o)
SET r += edge
The observed node and edge counts in target graphs seems smaller than expected with large graphs.
This may relate to the number of node records exceeding CACHE_SIZE (set to 100,000 by default but also programmatically mutable using the keyword arg 'cache_size', if given to the NeoSink constructor). Loss of edges results when edges are written before nodes are written to the output.
A possible solution is to ensure that all nodes encountered in the cache are flushed to the database first, before flushing of any batch of edges.
Note: it would likely be generally important for all node records to be read first from the Transformer Source prior to edge records. This concern was fixed in some places (e.g. tsv_source.py) during the --stream upgrade for KGX, but perhaps needs to be reviewed for other Source implementations.