Knowledge-Graph-Hub / kg-idg

A Knowledge Graph to Illuminate the Druggable Genome
https://knowledge-graph-hub.github.io/kg-idg/
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

CURIE prefix and predicate errors #41

Open caufieldjh opened 2 years ago

caufieldjh commented 2 years ago

Describe the bug

During the merge, KG-IDG produces a large number of "node id [id] has no CURIE prefix" and "Invalid predicate CURIE" errors, though it isn't immediately clear which source these are from.

To Reproduce

$ python3 run.py download
$ python3 run.py transform
$ python3 run.py merge 2> merge_out.log
$ sort merge_out.log | uniq -c | sort -n
...
      3 Warning: node id http://omim.org/entry/606689 has no CURIE prefix
      3 Warning: node id http://omim.org/entry/613364 has no CURIE prefix
      3 Warning: node id http://omim.org/entry/615383 has no CURIE prefix
      3 Warning: node id http://omim.org/entry/617704 has no CURIE prefix
      5 Invalid  predicate CURIE 'owl:versionIRI'? Ignoring...
     10 Invalid  predicate CURIE 'biolink:RegulateprocessToProcess'? Ignoring...
     14 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#narrowMatch'? Ignoring...
     29 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#broadMatch'? Ignoring...
     48 Invalid  predicate CURIE 'rdfs:isDefinedBy'? Ignoring...
     64 Invalid  predicate CURIE 'rdfs:seeAlso'? Ignoring...
     75 Invalid  predicate CURIE 'biolink:NegativelyRegulateprocessToProcess'? Ignoring...
    160 Invalid  predicate CURIE 'owl:disjointWith'? Ignoring...
  16468 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#closeMatch'? Ignoring...
  71842 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#exactMatch'? Ignoring...

Most of the node id prefix errors (not shown) appear to be from OMIM, e.g.:

 2 Warning: node id http://www.omim.org/phenotypicSeries/PS619142 has no CURIE prefix

Expected behavior

This will require some forensics to identify:

  1. which invalid CURIE is from which source
  2. whether it matters
  3. if it matters, what the correct CURIE should be
  4. if there isn't a preferred CURIE, what it should look like

AND/OR

Is this an expected part of how a KGX merge operates?

Version

8a5a018e33b07acb6b3d5582e4afe176f893d604

caufieldjh commented 2 years ago

As of the 20220203 build, here are the edge/node counts:

  total_edges: 5122941
  total_nodes: 1018616

The count of NamedThings, however, is only 985490, so there are 33,126 nodes without Biolink classes assigned, or at least have some category other than NamedThing. I suspect this is related to the OMIM CURIE warnings above, but will need to check on the merged graph's nodelist to find anything unexpected.

caufieldjh commented 2 years ago

Confirming:

$ grep -v NamedThing merged-kg_nodes.tsv | wc -l
33127

Looks like they're all ENSEMBL gene and protein IDs.

Three different sources use those:

~/kg-idg$ grep -rl ENSEMBL data/transformed/
data/transformed/string/string_edges.tsv_nodes.tsv
data/transformed/string/string_edges.tsv_edges.tsv
data/transformed/string/string_nodes.tsv_nodes.tsv
data/transformed/hpa/hpa-data_nodes.tsv
data/transformed/orphanet/orphanet_nodes.tsv
data/transformed/orphanet/orphanet_edges.tsv

HPA is a Koza transform and applies multiple Biolink cats appropriately. Orphanet is transformed from orphanet.nt and also assigns ENSEMBL to both Gene/Protein and NamedThing. So that leaves STRING - it lacks NamedThing assignments in the transformed nodelist. Will need to modify the transform accordingly.

caufieldjh commented 2 years ago

STRING issue was fixed by #74