Open caufieldjh opened 2 years ago
As of the 20220203 build, here are the edge/node counts:
total_edges: 5122941
total_nodes: 1018616
The count of NamedThing
s, however, is only 985490, so there are 33,126 nodes without Biolink classes assigned, or at least have some category other than NamedThing
. I suspect this is related to the OMIM CURIE warnings above, but will need to check on the merged graph's nodelist to find anything unexpected.
Confirming:
$ grep -v NamedThing merged-kg_nodes.tsv | wc -l
33127
Looks like they're all ENSEMBL gene and protein IDs.
Three different sources use those:
~/kg-idg$ grep -rl ENSEMBL data/transformed/
data/transformed/string/string_edges.tsv_nodes.tsv
data/transformed/string/string_edges.tsv_edges.tsv
data/transformed/string/string_nodes.tsv_nodes.tsv
data/transformed/hpa/hpa-data_nodes.tsv
data/transformed/orphanet/orphanet_nodes.tsv
data/transformed/orphanet/orphanet_edges.tsv
HPA is a Koza transform and applies multiple Biolink cats appropriately.
Orphanet is transformed from orphanet.nt
and also assigns ENSEMBL to both Gene
/Protein
and NamedThing
.
So that leaves STRING - it lacks NamedThing
assignments in the transformed nodelist.
Will need to modify the transform accordingly.
STRING issue was fixed by #74
Describe the bug
During the merge, KG-IDG produces a large number of "node id [id] has no CURIE prefix" and "Invalid predicate CURIE" errors, though it isn't immediately clear which source these are from.
To Reproduce
Most of the node id prefix errors (not shown) appear to be from OMIM, e.g.:
Expected behavior
This will require some forensics to identify:
AND/OR
Is this an expected part of how a KGX
merge
operates?Version
8a5a018e33b07acb6b3d5582e4afe176f893d604