PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

A CV.term contains several names and identifiers #266

Closed IgorRodchenkov closed 7 years ago

IgorRodchenkov commented 7 years ago

Dylan (@d2fong), analysing Pathway Commons pathway data in SBGN format, found a couple of abnormal terms, e.g. gene ontology ids that are used as compartment label instead of the corresponding ontology term - compartment name. example:

<glyph class="compartment" id="go_0005758">
    <label text="go:0005758"/>
</glyph>

He asks whether it is possible to alter these values so that they are consistent?

So I dug into the code and reviewed the pathway data and found things like these (it's a piece of Pathway Commons BioPAX data converted to JSON-LD):

{
    "@id" : "http://pathwaycommons.org/pc2/CellularLocationVocabulary_51abf0b80d2f83d8da3a904468119547",
    "@type" : "bp:CellularLocationVocabulary",
    "term" : [ "mitochondrial matrix", "Mitochondrial Matrix" ],
    "xref" : "http://pathwaycommons.org/pc2/UnificationXref_gene_ontology_GO_0005759"
  }, {
    "@id" : "http://pathwaycommons.org/pc2/CellularLocationVocabulary_8d4fc08becc119ce10c0387699eef670",
    "@type" : "bp:CellularLocationVocabulary",
    "term" : [ "Mitochondrial Intermembrane Space", "GO:0005758", "mitochondrial intermembrane space" ],
    "xref" : "http://pathwaycommons.org/pc2/UnificationXref_gene_ontology_GO_0005758"
  },

Ether authors added several values, including ID (which is not alright) to the original biopax controlled vocabulary's terms, or it's due to our data integration/merging... Looks, it's rather the latter, because original Reactome has:

<bp:CellularLocationVocabulary rdf:ID="CellularLocationVocabulary70">
    <bp:term rdf:datatype="http://www.w3.org/2001/XMLSchema#string">mitochondrial intermembrane space</bp:term>
    <bp:xref rdf:resource="#UnificationXref39164" />
</bp:CellularLocationVocabulary>
<bp:UnificationXref rdf:ID="UnificationXref39164">
    <bp:db rdf:datatype="http://www.w3.org/2001/XMLSchema#string">GENE ONTOLOGY</bp:db>
    <bp:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">GO:0005758</bp:id>
  </bp:UnificationXref>

So, we are to fix such CV term values either in sbgn-converter (a quick fix to ignore ontology IDs), or - more appropriate - in the BioPAX Validator/Normalizer, or in the cPath2 data merger...

IgorRodchenkov commented 7 years ago

I think it's fixed (where possible) now; check this: http://beta.pathwaycommons.org/pc2/get?uri=http://identifiers.org/reactome/R-HSA-1268020&format=sbgn

d2fong commented 7 years ago

Thanks Igor