Knowledge-Graph-Hub / kg-covid-19

An instance of KG Hub to produce a knowledge graph for COVID-19 response.
https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki
BSD 3-Clause "New" or "Revised" License
78 stars 26 forks source link

ACE2 interaction present in nt but missing in tsv (20201001 release) #375

Closed realmarcin closed 3 years ago

realmarcin commented 3 years ago

Describe the bug

A triple for interacts_with between ACE2 and GLP1R is present in the nt file but not tsv for 20201001 release.

To Reproduce

This triple: P43220 interacts_with Q9BYF1

is present in the .nt file from 20201001: https://kg-hub.berkeleybop.io/kg-covid-19/20201001/kg-covid-19.nt.gz

but not in the merged TSV for the release: (venv) [marcin@n0001 20201001]$ grep ENSP00000389326 merged-kg_edges.tsv | grep ENSP00000362353 (venv) [marcin@n0001 20201001]$

This interaction is present in the transformed STRING TSVs: (venv) [marcin@n0001 STRING_diff]$ grep ENSP00000389326 20201001_edges.tsv | grep ENSP00000362353
ENSEMBL:ENSP00000362353 biolink:interacts_with ENSEMBL:ENSP00000389326 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0 0 0 0 0 108 94 ENSEMBL:ENSP00000389326 biolink:interacts_with ENSEMBL:ENSP00000362353 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0 0 0 0 0 108 94 (venv) [marcin@n0001 STRING_diff]$ grep ENSP00000389326 20201101_edges.tsv | grep ENSP00000362353 ENSEMBL:ENSP00000362353 biolink:interacts_with ENSEMBL:ENSP00000389326 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0 0 0 0 0 108 94 ENSEMBL:ENSP00000389326 biolink:interacts_with ENSEMBL:ENSP00000362353 RO:0002434 STRING biolink:Association 157 0 0 0 0 0 0 0 0 0 0 0 108 94

(In fact, the 20201001_edges.tsv is identical to 20201101_edges.tsv).

Note that this interaction is absent in both the nt and tsv from 20201101.

The metadata from the 20201001 nt file suggests that STRING is the source and that this interaction is from text mining:

. . . . "STRING" . "biolink:Association"^^ . "157.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "0.0"^^ . "108.0"^^ . "94.0"^^ . ## Expected behavior That the nt and tsv semantically mirror each other. ### Version 20201001 release ### Additional context Discovered by Tomas Kliegr and group by rule mining on different releases.
cmungall commented 3 years ago

Might it be the case that tsv vs rdf is a red herring here?

You are comparing an individual transformed source file with the merged file. It seems more likely something is happening in the merge step, which may be intentional, e.g. clique merge

realmarcin commented 3 years ago

it is possible indeed -- in fact I am going to close this ticket and shift everything to the other one.

https://github.com/Knowledge-Graph-Hub/kg-covid-19/issues/376

realmarcin commented 3 years ago

reopening with more info, still .nt vs .tsv difference -- I think both should be products of the same clique merging etc?