PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

SMPDB: duplicate or dummy sub-pathways #243

Closed IgorRodchenkov closed 6 years ago

IgorRodchenkov commented 8 years ago

@cannin , et. al., please see/try:

http://beta.pathwaycommons.org/pc2/search?q=name:%22Propanoate%20metabolism%22&type=pathway&datasource=smpdb

50 pathways have the same name. This is just one example. Shall we still import these data into PC2?

IgorRodchenkov commented 8 years ago

Attn: @cannin @emekdemir @ozgunbabur @gbader @armish

<bp:Interaction rdf:ID="Interaction_3ef7debf0a3fc71a964bdd35d6011dc3">
 <bp:displayName rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">SubPathwayInteraction767</bp:displayName>
 <bp:name rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">SubPathwayReaction</bp:name>
 <bp:name rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">SubPathway767Reaction</bp:name>
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">REPLACED http://smpdb.ca/pathways/#SubPathwayInteractions/767</bp:comment>
 <bp:dataSource rdf:resource="#smpdb" />
 <bp:participant rdf:resource="#SmallMolecule_1f23eb7807566d005690e5eff016fd8b" />
</bp:Interaction>

- and that's also used in, e.g.  - 

<bp:PathwayStep rdf:ID="PathwayStep_744c82ea84ea65b2aad68080fe5d6ff4">
 <bp:comment rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">REPLACED http://smpdb.ca/pathways/#SubPathwayInteractionSteps/SubPathway767</bp:comment>
 <bp:stepProcess rdf:resource="#Interaction_3ef7debf0a3fc71a964bdd35d6011dc3" />
</bp:PathwayStep>

and one (from original PW000149 data file) has 44 pathway components, which seems to be the "true" pathway definition. Each of those 49 weird (sub-)pathways, in fact used to have the same original URI "http://smpdb.ca/pathways/#SubPathways/767", which after merging in PC2 have become 49 different URIs (this is done for data consistency/integrity, - some providers are known to have same URIs attached to different and even different type biopax objects in different input files).

A similar issue (#205) we had with KEGG pathways; - solved by merging them based on presence of standard kegg pathway identifier (via UnificationXref). There are also standard (MIRIAM) stable pathway IDs in the SMPDB BioPAX; so, we could safely merge these alike. E.g., all those 50 pathways contain same UnificationXrefs with id: SMP00016 and PW000149 (pathwhiz - no idea what that means; not in MIRIAM).

(The last question is why those 49 pathways do contain that weird interaction?..)

IgorRodchenkov commented 8 years ago

Also, there are many (sub-)pathways that have only name (no components, no xrefs at all), such as "G-protein signalling cascade" in SMP00327.owl. E.g., this search query returns three hits - all are empty no-xrefs pathways... These were not merged automatically and hang around.

IgorRodchenkov commented 8 years ago

Done.

IgorRodchenkov commented 6 years ago

There are still too many simple sub-pathways, often using the same name, only-two pathway components, which is always a element (not recommended in BioPAX) with one or two small molecules as participants. Some of such sub-pathway URIs have xml:base http://identifiers.org/smpdb/ while others use "http://smpdb.ca/pathways/#" base and have no xrefs (we unable to normalize and merge these pathways).