Closed edeutsch closed 1 year ago
I've seen this a number of times too. I think your idea of identifying a few "mergeable" categories is a good one. named_thing and the like are tricky, but however they are handled, it has to be better than what's currently happening.
yeah, interesting idea... FYI, it looks like there are 7,744 nodes in KG2c that have both ("gene" or "protein") AND ("disease" or "phenotypic_feature") in their types. (this represents about 2% of all protein/gene nodes and 2% of all disease/phenotypic feature nodes in KG2c.)
@edeutsch is this still relevant?
@edeutsch - I think we can close this issue now that the synonymizer prohibits merging identifiers with conflicting categories? (where conflicting categories are categories that belong to different 'major branches' of the Biolink tree, with some modifications to the BiologicalEntity branch..)
agreed! Thanks for redesigning a much better NodeSynonymizer! closing.
One design flaw of the NodeSynonymizer is that it will lump all things with the same name together. Consider this example:
In this case, Acetabular dysplasia the gene (!) gets lumped in with Acetabular dysplasia the phenotypic_feature. Maybe we need to redesign the NodeSynonymizer so that it can have two different concepts both named Acetabular dysplasia belonging to two unmergeable groups. and we need to have the concept of merge groups. phenotypic_feature and disease_or_phenotypic_feature can be merged. Maybe gross_anatomical_structure can be merged with phenotypic_feature but never with gene. gene and protein can be merged with each other, but never with a phenotypic_feature, etc. named_thing can be merged with..... either? Where would I put MEDDRA:10000396 the named_thing? with the phenotypic_feature group or with the gene or protein group?
Not easy.
https://www.ncbi.nlm.nih.gov/gene/780896