Knowledge-Graph-Hub / kg-covid-19

An instance of KG Hub to produce a knowledge graph for COVID-19 response.
https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki
BSD 3-Clause "New" or "Revised" License
78 stars 26 forks source link

Improve normalization of proteins #386

Open justaddcoffee opened 3 years ago

justaddcoffee commented 3 years ago

Describe the bug

At least some proteins need normalization - e.g. ACE2:

UniProtKB:Q9BYF1        ACE2    pharmgkb|intact|go-cams
NCBIGene:59272  ACE2    zhou_host_proteins|SciBite-CORD-19
ENSEMBL:ENSG00000130234 ACE2    STRING  # this is the gene, so a separate node arguably is okay (ish)

To Reproduce

$ wget https://kg-hub.berkeleybop.io/kg-covid-19/20210101/kg-covid-19.tar.gz
$ tar xvzf kg-covid-19.tar.gz
$ cut -f1,2,4 merged-kg_nodes.tsv | grep -w -E 'ACE2' | grep -v "^CORD" # ignore CORD-19 papers that mention ACE2 in description

Expected behavior

Should see something like:

UniProtKB:Q9BYF1 ACE2 pharmgkb|intact|go-cams| zhou_host_proteins|SciBite-CORD-19|STRING

Version

version 20210101

justaddcoffee commented 2 years ago

Per presentation by @cmungall at Monarch huddle today, we can improve normalization by doing clique merging with KGX + an SSSOM file