TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

Bulk download of SRI NN data only contains Genes and Proteins? #185

Closed amykglen closed 1 year ago

amykglen commented 1 year ago

thanks for making the bulk download at the end of March!

I've begun using the files @gaurav pointed me to, here: https://stars.renci.org/var/babel_outputs/2022dec2-2/kgx/

I noticed that the nodes file (KGX_NN_data-2023mar22_nodes.jsonl.gz) seems to only contains Gene and Protein nodes.

these are the node counts I'm getting by category:

{
  "Gene": 51306192,
  "Protein": 250893067
}

and by prefix:

{
  "NCBIGene": 41177208,
  "ENSEMBL": 29643120,
  "ZFIN": 38076,
  "FB": 30263,
  "WormBase": 81220,
  "HGNC": 43530,
  "OMIM": 16968,
  "UMLS": 207988,
  "RGD": 66943,
  "MGI": 79841,
  "SGD": 7154,
  "dictyBase": 13893,
  "UniProtKB": 230567075,
  "PR": 225980
}

I was under the impression that this download should include all nodes the SRI NN knows about, including Diseases, Drugs, etc.?

thanks!

gaurav commented 1 year ago

Ack! Thanks for pointing this out, Amy -- it turns out that I used the default config file, which only uses Gene.txt and Protein.txt.

https://github.com/TranslatorSRI/NodeNormalization/blob/baa1167461697c1a7b03bf65c2cbc7c55c0920d1/config.json#L7

I will rerun the KGX generation with the full compendium over the weekend and have the updated KGX files for you on Monday. Sorry about this!

amykglen commented 1 year ago

ah, got it. no worries! that sounds great. thanks!

gaurav commented 1 year ago

@amykglen Could you please try the KGX dumps at https://stars.renci.org/var/babel_outputs/2022dec2-2/kgx/ and see if that fixes this bug?

amykglen commented 1 year ago

looks good to me - thank you!