TranslatorSRI / NodeNormalization

Service that produces Translator compliant nodes given a curie
MIT License
9 stars 6 forks source link

are UniProtKB IDs always mapped to NCBIGene and other Gene IDs? #128

Open colleenXu opened 2 years ago

colleenXu commented 2 years ago

We notice that CCL4's NCBIGene ID will not retrieve a UniProtKB ID. However, there does seem to be a CCL4 UniProtKB ID. And giving the CCL4 UniProtKB ID as input doesn't retrieve a NCBIGene ID.

This is a bit confusing to us, since CDK2 will retrieve a UniProtKB ID when given a NCBIGene ID and vice versa.

So in some cases, UniProtKB IDs are mapped to the gene entities and in other cases they aren't...

cbizon commented 2 years ago

Nodenorm allows you to toggle between conflating genes and proteins or not conflating. For instance, this is the un-conflated call for the CDK2 gene. You can see that only gene symbols are brought back. The conflate parameter defaults to True so nodenorm will by default conflate genes and proteins when it can.

The allowed gene/protein conflations are taken from UniProtKB's NCBIGene mappings. Now, the thing about UniProtKB is that the proteins in sprot are mostly 1-to-1 with NCBIgenes. Adding in Trembl turns this into 1 NCBIGene -> many UniProtKB. So far so good.

But there are a few genes (I think in humans it's something like 70) where a UniProtKB maps to more than one NCBIGene identifier. CCL4 is one of these, as it maps to both NCBIGene:388372 and NCBIGene:6351. You can read a little about how UniProtKB is working here.

So in this case there were I think three choices: 1) conflate the single uniprot with both genes (and their proteins, and their genes etc) or 2) don't conflate when one protein maps to >1 gene, or 3) try to figure out the "right" match. (1) led to unhappily large and confusing gene/protein clusters. (3) may not even make sense. So we chose (2). Perhaps there is a good way to go back and try (3) again, but I'm not immediately sure what it would be.

colleenXu commented 2 years ago

I see....I think we're fine with the current situation. Just checking what was going on. It looks like we hit one of the edge-cases (70 out of thousands....)