PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

Wrong PR merging due to id-mapping using only xref/id without db #232

Closed IgorRodchenkov closed 8 years ago

IgorRodchenkov commented 8 years ago

Good news: this is not an issue in coming PC2v8 release (using codebase from the pc2v8 branch).

But this can be a critical issue that messes with merging entity references and thus affects graph queries and SIF/GSEA output if the latest cPath2 code is used for id-mapping and merging, to build the cpath2 db/instance (current master branch, where the Merger and id-mapping were re-factored considerably...).

The problem is that PANTHER BioPAX has UnificationXrefs like db:"panther pathway component", id:"P02814", which, by chance or not, is the same as the UniProt:P02814 (though no uniprot xrefs are attached to the original ProteinReference) and can potentially refer to a different thing; and the updated Merger does not use xref.db in id-mapping (if a PR was not merged, it also tries to add extra primary uniprot/hgnc xrefs by mapping...).
This is too bad unless... proactive and sharp PANTHER authors actually make sure that their "panther pathway component" IDs ALWAYS refer to exactly the same reference protein as the corresponding UniProt ID...

Example (when luckily the ID actually means the same protein in Panther Pathway Component and UniProt):

<bp:ProteinReference rdf:ID="ProteinReference_1d5098fe4a0c59fd86f6339862e3ca22">
 <bp:xref rdf:resource="#UnificationXref_panther_pathway_component_P02814" />
 <bp:displayName rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Alanine racemase</bp:displayName>
 <bp:standardName rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">Alanine racemase</bp:standardName>
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">ENTITY_REFERENCE_NOTES=Long Name: Alanine racemase Synonym: Synonym not specified Accession: P02814 </bp:comment>
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">ENTITY_REFERENCE_ID_DESC=Alanine biosynthesis.PROTEIN.GENERIC.Alanine racemase</bp:comment>
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">ENTITY_REFERENCE_PROTEIN_TYPE=GENERIC</bp:comment>
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">REPLACED http://www.pantherdb.org/pathways/biopax/P02724#_Alanine_racemase_PROTEIN</bp:comment>
</bp:ProteinReference>

This is yet to be figured out.