RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Need to redesign the NodeSynonymizer with class groups? #1217

Closed edeutsch closed 1 year ago

edeutsch commented 3 years ago

One design flaw of the NodeSynonymizer is that it will lump all things with the same name together. Consider this example:

NodeNamesDescriptions_KG1.tsv:HP:0008807    Acetabular dysplasia    phenotypic_feature
NodeNamesDescriptions_KG2.tsv:UMLS:C1328407 Acetabular dysplasia    gross_anatomical_structure
NodeNamesDescriptions_KG2.tsv:HP:0008807    Acetabular dysplasia    phenotypic_feature
NodeNamesDescriptions_KG2.tsv:MEDDRA:10000396   Acetabular dysplasia    named_thing
NodeNamesDescriptions_KG2.tsv:UMLS:C3151603 Acetabular dysplasia (rare) disease_or_phenotypic_feature
NodeNamesDescriptions_KG2.tsv:OMIM:MTHU028558   Acetabular dysplasia (rare) phenotypic_feature
NodeNamesDescriptions_KG2.tsv:UMLS:C4228370 Acetabular dysplasia, bilateral disease_or_phenotypic_feature
NodeNamesDescriptions_KG2.tsv:OMIM:MTHU052219   Acetabular dysplasia, bilateral phenotypic_feature
NodeNamesDescriptions_KG2.tsv:OMIM:MTHU011155   Acetabular dysplasia    phenotypic_feature
NodeNamesDescriptions_KG2.tsv:UMLS:C4229765 Acetabular dysplasia (1 family) disease_or_phenotypic_feature
NodeNamesDescriptions_KG2.tsv:OMIM:MTHU050620   Acetabular dysplasia (1 family) phenotypic_feature
NodeNamesDescriptions_KG2.tsv:NCBIGene:780896   Acetabular dysplasia    gene

In this case, Acetabular dysplasia the gene (!) gets lumped in with Acetabular dysplasia the phenotypic_feature. Maybe we need to redesign the NodeSynonymizer so that it can have two different concepts both named Acetabular dysplasia belonging to two unmergeable groups. and we need to have the concept of merge groups. phenotypic_feature and disease_or_phenotypic_feature can be merged. Maybe gross_anatomical_structure can be merged with phenotypic_feature but never with gene. gene and protein can be merged with each other, but never with a phenotypic_feature, etc. named_thing can be merged with..... either? Where would I put MEDDRA:10000396 the named_thing? with the phenotypic_feature group or with the gene or protein group?

Not easy.

https://www.ncbi.nlm.nih.gov/gene/780896

dkoslicki commented 3 years ago

I've seen this a number of times too. I think your idea of identifying a few "mergeable" categories is a good one. named_thing and the like are tricky, but however they are handled, it has to be better than what's currently happening.

amykglen commented 3 years ago

yeah, interesting idea... FYI, it looks like there are 7,744 nodes in KG2c that have both ("gene" or "protein") AND ("disease" or "phenotypic_feature") in their types. (this represents about 2% of all protein/gene nodes and 2% of all disease/phenotypic feature nodes in KG2c.)

finnagin commented 2 years ago

@edeutsch is this still relevant?

amykglen commented 1 year ago

@edeutsch - I think we can close this issue now that the synonymizer prohibits merging identifiers with conflicting categories? (where conflicting categories are categories that belong to different 'major branches' of the Biolink tree, with some modifications to the BiologicalEntity branch..)

edeutsch commented 1 year ago

agreed! Thanks for redesigning a much better NodeSynonymizer! closing.