Wrong xrefs in ProteinReference

ozgunbabur commented 8 years ago

Found a bug in the xrefs of some ProteinReference of v8. Look at the PR with ID

http://purl.org/pc2/8/Protein_84cb92e17a9f567f5456a46088aa57e1

It is related to the gene PANK2, and it has the xref [db:HGNC Symbol, id: PANK2], just as expected.

But it also have the xref [db: HGNC, id: HGNC:8598], and the xref [db: HGNC, id: HGNC:19365]. The first one belongs to PANK1 and the second one belongs to PANK3.

That is not new to v8. It existed in v7 too.

IgorRodchenkov commented 8 years ago

Good catch, not a trivial thing.., and it's unclear how to perfectly fix it...

The Protein (originally from ReconX) has no xrefs, but the corresponding ProteinReference (PR), http://webservice.baderlab.org:48080/get?uri=http://identifiers.org/uniprot/Q9BZ23, has got multiple xrefs (those come from each original PR mapped to this primary canonical UniProt PR).

And indeed, the PR (Q9BZ23) also has HGNC:8598 and HGNC:19365 Xrefs (originate from the translated and normalized KEGG data: RelationshipXref_kegg_hsa_hsa00770_translatedHGNC_19365_2, RelationshipXref_kegg_hsa_hsa00770_translatedHGNC_8598_2), which map to different primary UniProt accession numbers (AC); see http://webservice.baderlab.org:48080/idmapping?id=HGNC:8598&id=HGNC:19365&id=HGNC:15894.

When Merger "decides" to replace an original PR with the one from the warehouse, it also then copies all the xrefs from the original PR to the canonical one. So, despite we take care not to replace a PR unless it uniquely maps to only one UniProt AC, in the above case, we created confusion or mess... If we'd not have copied the xrefs, we'd have lost all original ones always...

More details. What happens is that e.g., KEGG hsa_00770 has a PR like (and the corresponding Protein also has these same xrefs, by the way), which is hard to tell what it means...:

PANK4/bp:name PANK2/bp:name HSS/bp:name PANK3/bp:name PKAN/bp:name HARP/bp:name PANK1/bp:name C20orf48/bp:name NBIA1/bp:name PANK/bp:displayName PANK1, MGC24596, PANK, PANK1a, PANK1b.../bp:standardName /bp:ProteinReference I suspect that either KEGG db's or KEGGTranslator tool's authors actually meant a generic PR; i.e., there must be separate memberEntityReference value (a PR) for each PANK*, etc. And we would not have this issue then. However, when the above PR (#hsa53354_...), passed the BioPAX Normalizer, it becomes: NBIA1/bp:name HSS/bp:name HARP/bp:name PKAN/bp:name PANK4/bp:name ... /bp:ProteinReference - Normalizer does not check if the unification xrefs there are consistent (point to the same thing), and simply picks the first valid one (sorted by id alphabetically) to use in the new URI - Q6P1K9 - which, by the way, does not map to any PC2 warehouse primary UniProt AC PR, because Q6P1K9 is an "Unreviewed" UniProt entry; it's not in the Swiss-Prot human dataset that we imported. So, when it comes to merging KEGG into the PC2 main BioPAX model, the normalized Q6P1K9 PR, using all its unification xrefs, happens to map uniquely to the Q9BZ23 PR in our warehouse and thus gets replaced. But the original xrefs are also copied to the canonical Q9BZ23 PR, which also belongs to other proteins from several data sources (CTD, ReconX, etc.). And that's how we get the issue reported by Ozgun. A simple fix (in the Merger) would be - skip replacing such PRs (e.g., also check whether id-mapping using all the unification and rel. xrefs results in the same unique UniProt AC). A more tricky fix would be to replace such PR with a generic PR and multiple member PRs (can implement this fix in the KEGG Cleaner), but this would require much re-factoring and using a separate id-mapper (otherwise, it's impossible to reliably re-construct a generic PR and members from the original messy PR). PS: I start to belive that we'd better never merge anything at all (never merge EntityReferences by URI, Xrefs) but just store original/normalized data as is. Perhaps, just enrich with more xrefs using id-mapping for full-text search and patterns to work better... This would be similar to what we actually did in PC1 (pathwaycommons.org/pc/).

IgorRodchenkov commented 8 years ago

Now fixing (in the biopax-validator Normalizer, KeggCleanerImpl, and cPath2 Merger)...

IgorRodchenkov commented 8 years ago

Oops, looks, I found the bug in the PC2 Merger (in idMappingByXrefsIntersection method), - the true reason why e.g. "hsa53354_hsa55229_hsa79646_hsa80025.eref" PR merges into canonical "Q9B223" despite id-mapping by xrefs is ambiguous (it must not merge!) Fixing now...

PathwayCommons / cpath2

Wrong xrefs in ProteinReference #224