PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

SMPDB data issues #263

Closed IgorRodchenkov closed 5 years ago

IgorRodchenkov commented 7 years ago

New version of SMPDB BioPAX data (we downloaded release 05-Jun-2016 BioPAX archive and imported into beta PC9) contains UnificationXrefs (of a Pathway) like:

<bp:UnificationXref rdf:ID="Reference/SMPDB_SMP00001">
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">http://identifiers.org/smpdb/SMP00001</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">SMPDB</bp:db>
</bp:UnificationXref>
<bp:UnificationXref rdf:ID="Reference/SMPDB_SMP00001">
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">SMP00001</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">SMPDB</bp:db>
</bp:UnificationXref>

Using an URI in an Xref.id property is a mistake. Those URIs like "http://identifiers.org/smpdb/SMP00001" should be URI of corresponding Pathway BioPAX objects instead (recommended; it also helps to avoid duplicate pathways when integrating multiple SMPDB BioPAX files into one model).

For example, instead of:

<bp:Pathway rdf:ID="Pathways/PW000185">
...
<bp:pathwayComponent rdf:resource="#SubPathways/1" />

much better would be to use (for pathway definitions and references - where you have official standard URIs and IDs):

<bp:Pathway rdf:about="http://identifiers.org/smpdb/SMP00001">
...
<bp:pathwayComponent rdf:resource="http://identifiers.org/smpdb/SMP00055" />

This would make SMPDB BioPAX more useful for everyone (currently, we in Pathway Commons, have to do these fixes, replace URIs and IDs to integrate and use SMPDB data...)

(Let's contact SMPDB authors and also update/fix our data cleaner code.)

@cannin @gbader @emekdemir @ozgunbabur @jvwong

IgorRodchenkov commented 7 years ago

This, I bet, also makes #243 re-occur... need to check and test again (after re-building beta PC9 instance from scratch...)

IgorRodchenkov commented 7 years ago

Also, strings: "SubPathway", "SubPathwayOutput", "SubPathwayInput" should not be values of BioPAX property 'name' of a Pathway, SmallMolecule, etc. This is another biopax hack - probably, not so useful or even misleading for data analysts... Better put these values to 'comment' BioPAX property.

IgorRodchenkov commented 6 years ago

SMPDB have already addressed the URI and xref.id issues above. Great.

IgorRodchenkov commented 6 years ago

More SMPDB observations and ideas;

  1. I would simply remove all the sub-pathways, clear pathwayOrder property, remove PathwaySteps (having no nextStep, usually one dummy interaction as stepProcess); it's ok. I could suggest a better way to model pathway steps (using stepConversion, stepProcess and nextStep biopax properties; similar to what Reactome does); let's chat with SMPDB team.
  2. Some Catalysis and their controllers are missing - e.g., Triosephosphate isomeraze complex is on the SMP00040 view but it's missing from the corresponding PW000146.owl file (unlike, e.g., Glucose-6-phosphate isomerase); it's also on SMP00064 view.
IgorRodchenkov commented 5 years ago

(the following comment is from pathway-commons-dev emails, 29 Aug - 7 Sep, 2018)

... here is the list of problem pathways and their number of instances:
Cardiolipin Biosynthesis (S. cerevisiae)          8958
Cardiolipin Biosynthesis Pathways (H. sapiens)         3277
Cardiolipin Biosynthesis (Barth Syndrome) (H. sapiens)       20016
De Novo Triacylglycerol Biosynthesis Pathways (H. sapiens) 22656
Phosphatidylcholine Biosynthesis Pathways (H. sapiens)     922
Phosphatidylcholine Biosynthesis (S. cerevisiae)      162
Phosphatidylethanolamine Biosynthesis (H. sapiens)           922
Phospholipid Biosynthesis (E. coli)     910
Triacylglycerol Degradation (A. thaliana)       1728
Triacylglycerol Metabolism (S. cerevisiae)     322

So.. let's just remove SMPDB altogether from PC11...

cannin commented 5 years ago

That would be my vote.