biopragmatics / pyobo

📛 A Python package for using ontologies, terminologies, and biomedical nomenclatures
https://pyobo.readthedocs.io
MIT License
59 stars 12 forks source link

coordinate with SIB on conversion of swisslipids #140

Open cmungall opened 1 year ago

cmungall commented 1 year ago

I am sitting next to @JervenBolleman, he is showing me his conversion of swisslipids to obo/owl based on https://beta.sparql.swisslipids.org/. It would be great if we can agree on a canonical serialization

cc @dosumis

dosumis commented 1 year ago

CC @rays22

JervenBolleman commented 1 year ago

This is of interest to us SwissLipids as well.

cthoyt commented 1 year ago

@cmungall @JervenBolleman is there any possibility the SIB can host me for a week or two to work on this coordination / we can work together to get project funding for this? Otherwise, asking to change all of the useful practicalities of PyOBO to align externally is a pretty big ask. I like the idea, though

JervenBolleman commented 1 year ago

@cthoyt let's talk about this at biocuration. In the meantime maybe @cmungall can introduce us email wise.

JervenBolleman commented 1 year ago

I just wanted to add an example of what comes out of ROBOT convert of the Swiss-Lipids.rdf

OBO

[Term]
id: SLM:000003492
name: 1-(21Z,24Z,27Z,30Z-hexatriacontatetraenoyl)-2-tetradecanoyl-sn-glycero-3-phospho-L-serine
is_a: SLM:000000336 ! 1,2-diacyl-sn-glycero-3-phospho-L-serine
is_a: SLM:000114461 ! Phosphatidylserine (36:4/14:0)
relationship: BFO:0000051 SLM:000000825 ! tetradecanoate
relationship: BFO:0000051 SLM:000001232 ! (21Z,24Z,27Z,30Z)-hexatriacontatetraenoate
property_value: altLabel PS(36:4(21Z,24Z,27Z,30Z)/14:0) xsd:string
property_value: CHEMINF:000412 SLM:000003492 xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/charge "-1" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/formula "C56H101NO10P" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/inchi "InChI=1S/C56H102NO10P/c1-3-5-7-9-11-13-
15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-38-39-41-43-45-47-54(58)64-49-52(
50-65-68(62,63)66-51-53(57)56(60)61)67-55(59)48-46-44-42-40-37-14-12-10-8-6-4-2/h11,13,16-17,19-20,
22-23,52-53H,3-10,12,14-15,18,21,24-51,57H2,1-2H3,(H,60,61)(H,62,63)/p-1/b13-11-,17-16-,20-19-,23-2
2-/t52-,53+/m1/s1" xsd:string
property_value: http://purl.obolibrary.org/obo/chebi/inchikey "GASNTXNDIBBTJQ-JRWRWSKCSA-M" xsd:str
ing
property_value: http://purl.obolibrary.org/obo/chebi/smiles "CCCCCCCCCCCCCC(=O)O[C@H](COC(=O)CCCCCC
CCCCCCCCCCCCC\\C=C/C\\C=C/C\\C=C/C\\C=C/CCCCC)COP([O-])(=O)OC[C@H]([NH3+])C([O-])=O" xsd:string
property_value: seeAlso https://rdf.metanetx.org/chem/MNXM253867
property_value: SLM:hasPart SLM:000000825
property_value: SLM:hasPart SLM:000001232
property_value: SLM:rank https://swisslipids.org/rdf/SLM_Isomeric_Subspecies

RDF

SLM:000003492 a owl:Class ;
  SLid: 'SLM:000003492' ;
  SLM:rank SLM:Isomeric_Subspecies ;
  rdfs:label "1-(21Z,24Z,27Z,30Z-hexatriacontatetraenoyl)-2-tetradecanoyl-sn-glycero-3-phospho-L-serine" ; 
  skos:altLabel "PS(36:4(21Z,24Z,27Z,30Z)/14:0)" ; 
  rdfs:subClassOf SLM:000000336 ;
  rdfs:subClassOf 
SLM:000114461 ;
  chebislash:inchi "InChI=1S/C56H102NO10P/c1-3-5-7-9-11-13-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-38-39-41-43-45-47-54(58)64-49-52(50-65-68(62,63)66-51-53(57)56(60)61)67-55(59)48-46-44-42-40-37-14-12-10-8-6-4-2/h11,13,16-17,19-20,22-23,52-53H,3-10,12,14-15,18,21,24-51,57H2,1-2H3,(H,60,61)(H,62,63)/p-1/b13-11-,17-16-,20-19-,23-22-/t52-,53+/m1/s1" ; 
  chebislash:inchikey "GASNTXNDIBBTJQ-JRWRWSKCSA-M" ; 
  rdfs:seeAlso metanetx:MNXM253867 ;
  chebislash:charge "-1" ; 
  rdfs:subClassOf [ 
   a owl:Restriction ;
   owl:onProperty haspart: ;
   owl:someValuesFrom SLM:000001232 ] ;
  rdfs:subClassOf [ 
   a owl:Restriction ;
   owl:onProperty haspart: ;
   owl:someValuesFrom SLM:000000825 ] ;

 SLM:hasPart SLM:000001232 ,
    SLM:000000825 ;
  chebislash:smiles '''CCCCCCCCCCCCCC(=O)O[C@H](COC(=O)CCCCCCCCCCCCCCCCCCC\\C=C/C\\C=C/C\\C=C/C\\C=C/CCCCC)COP([O-])(=O)OC[C@H]([NH3+])C([O-])=O''' ; 
  chebislash:formula "C56H101NO10P" .

I believe that it should be possible for the OBO to be nicer to look at.

SwissLipids is a proper extension of ChEBI so the OBO has a bunch of stanza's like this

[Term]
id: CHEBI:78102 ! 1-tetradecyl-sn-glycero-3-phosphocholine
equivalent_to: SLM:000001362 ! 1-O-tetradecyl-sn-glycero-3-phosphocholine
dosumis commented 1 year ago

Hi all - I think all we need need is

(a) a reliable source of a SwissLipids ontology file (b) stable IRIs (@cthoyt & @JervenBolleman - do you think it would be possible to at least agree on short_form IDs?) (c) A class hierarchy that links to CHEBI (which shouldn't be hard given that "SwissLipids is a proper extension of ChEBI"). CHEBI IRIs should follow OBO standard.

All other details can evolve without breaking our use case.

It looks like the SwissLipids release might already do all of this (perhaps apart from CHEBI IDs?). Maybe PyOBO can, but I haven't seen the SwissLipids ontology product from PyOBO yet. @cthoyt would you be able to post a link or a recipe for generating?

If we can get agreement on (b) we could switch between pyOBO or swisslipds versions if needed.

cthoyt commented 1 year ago

Conversion code: https://github.com/pyobo/pyobo/blob/main/src/pyobo/sources/slm.py Artifacts: https://github.com/biopragmatics/obo-db-ingest/tree/main/export/swisslipid

PyOBO will always follow the Bioregistry standard, so if you want to talk about changing the prefix we can do a discussion on the tracker there https://github.com/biopragmatics/bioregistry/issues

cmungall commented 1 year ago

I think we just need to agree on the ID prefix and then the official swisslipids file satisfies @dosumis criteria (there are other things it would be good to iterate on, as per my original comment in this ticket, but this can come later).

Related to the ID discussion: should the ontology artefact be registered on OBO? Given that this is an extension to CHEBI and follows the same structure it seems reasonable. This might require having an obolibrary base to the PURLs, which may not be desirable to SIB (although there are some exceptions in OBO).

JervenBolleman commented 1 year ago

Using the swisslipids beta sparql endpoint and robot

curl -L -H 'accept:text/turtle' 'https://beta.sparql.swisslipids.org/sparql/' \
  --data 'query=PREFIX+foaf%3a+%3chttp%3a%2f%2fxmlns.com%2ffoaf%2f0.1%2f%3e%0d%0aCONSTRUCT+%7b%0d%0a++%3fs+%3fp+%3fo+.%0d%0a%7d+WHERE+%7b%0d%0a++GRAPH+%3chttps%3a%2f%2fsparql.swisslipids.org%2fswisslipids%3e%7b%0d%0a++++%3fs+%3fp+%3fo+.%0d%0a++++FILTER(!sameTerm(%3fp%2c+foaf%3adepiction))%0d%0a%09%7d%0d%0a%7d' \
  -o swisslipids.ttl

robot convert --input swisslipids.ttl  --output swisslipids.obo

We avoid the images as we don't want those in the OBO file. And they are large and that will lead to issues for ROBOT on normal hardware.

@dosumis

a) We are looking into providing the obo and ttl or RDF at a preconverted at a stable location. This will take some time, as going from prototype to production always does.

b) IRI's are easier to agree on than CURIE's. I see no real reason why not, but this would be a bigger change that I would need to discuss with others and gather feedback from SwissLipid users. At this point in time it would require a small postprocessing step of the ROBOT output. e.g.

sed -i 's|SLM:|swisslipid:|g' swisslipids.obo

but this might lead to issues on the obo to owl conversion with ROBOT. So needs investigation.

c) Already the case. See the stanza, which I believe is OWL and OBO correct but unexpected for most obo users.

[Term]
id: CHEBI:78102 ! 1-tetradecyl-sn-glycero-3-phosphocholine
equivalent_to: SLM:000001362 ! 1-O-tetradecyl-sn-glycero-3-phosphocholine

@cmungall Regarding: SwissLipids joining the OBO foundry etc. is a different commitment that I will also need to talk about in the team. Let's move that off this issue.

matentzn commented 1 year ago

It seems that we further need to coordinate with a resource called LIPID MAPS which seems to cover some relevant lipids that are not covered by swisslipids..

https://www.lipidmaps.org/resources/sparql

Unfortunately, their SPARQL endpoint is down.