calipho-sib / cellosaurus

A knowledge resource on cell lines - From SIB CALIPHO group
https://www.cellosaurus.org
Creative Commons Attribution 4.0 International
13 stars 1 forks source link

owltools parser failure due to invalid characters in `xref` URL #2

Open vasiliy-bout opened 8 years ago

vasiliy-bout commented 8 years ago

Hello,

We use https://github.com/ontodev/robot tool to convert cellosaurus.obo into OWL format. And recently we tried to update this tool to the newest version. But we failed to do this because the newest version cannot convert current version of the cellosaurus.obo file. Another tool owltools from https://github.com/owlcollab/owltools gives an error either.

The reason why both tools fails is that the line 224456 contains invalid characters that cannot occur in URL:

xref: https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html

We have tried two approaches to fix this issue in the source OBO. The first one is to enclose invalid URL in quotes:

xref: "https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html"

and the second one is to encode invalid characters [ and ] with percent-encoded octets:

xref: https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-%5BEC-RF24%5D-T0003.html

The first approach gives strange incorrect OWL output (maybe because quoted string is interpreted as a comment and not as an external URL):

    <!-- http://purl.obolibrary.org/obo/TEMP#CVCL_AX74 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74">
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/TEMP#originate_from_same_individual_as"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX75"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string"></oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BTO:BTO:0004188</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBI_TaxID:9606</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PubMed:7813621</oboInOwl:hasDbXref>
        <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">cellosaurus</oboInOwl:hasOBONamespace>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF</oboInOwl:hasRelatedSynonym>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF</oboInOwl:hasRelatedSynonym>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF 24</oboInOwl:hasRelatedSynonym>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF24</oboInOwl:hasRelatedSynonym>
        <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CVCL_AX74</oboInOwl:id>
        <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/TEMP#Transformed_cell_line"/>
        <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">&quot;Transfected with: UniProtKB; P00552; Transposon Tn5 neo. Transformant: HPV16 E6/E7 (pLXSN16).&quot;</rdfs:comment>
        <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF24</rdfs:label>
    </owl:Class>
    <owl:Axiom>
        <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74"/>
        <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
        <owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string"></owl:annotatedTarget>
        <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html</rdfs:label>
    </owl:Axiom>

This Axiom has no value for owl:annotatedTarget (where dbxref URL must be located) and instead has label tag which is meaningless for the axiom.

And the second approach gives correct OWL output:

    <!-- http://purl.obolibrary.org/obo/TEMP#CVCL_AX74 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74">
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/TEMP#originate_from_same_individual_as"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX75"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BTO:BTO:0004188</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBI_TaxID:9606</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PubMed:7813621</oboInOwl:hasDbXref>
        <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-%5BEC-RF24%5D-T0003.html</oboInOwl:hasDbXref>
        <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">cellosaurus</oboInOwl:hasOBONamespace>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF</oboInOwl:hasRelatedSynonym>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF</oboInOwl:hasRelatedSynonym>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF 24</oboInOwl:hasRelatedSynonym>
        <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF24</oboInOwl:hasRelatedSynonym>
        <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CVCL_AX74</oboInOwl:id>
        <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/TEMP#Transformed_cell_line"/>
        <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">&quot;Transfected with: UniProtKB; P00552; Transposon Tn5 neo. Transformant: HPV16 E6/E7 (pLXSN16).&quot;</rdfs:comment>
        <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF24</rdfs:label>
    </owl:Class>

So I suggest to replace invalid symbols in URLs with percent-encoded octets.

AmosBairoch commented 8 years ago

Agree with the proposed change, in release 18 this URL will use percent-encoded octet. Best Amos

On 14.06.2016 21:15, Vasiliy Bout wrote:

Hello,

We use https://github.com/ontodev/robot tool to convert cellosaurus.obo into OWL format. And recently we tried to update this tool to the newest version. But we failed to do this because the newest version cannot convert current version of the cellosaurus.obo file. Another tool |owltools| from https://github.com/owlcollab/owltools gives an error either.

The reason why both tools fails is that the line 224456 contains invalid characters that cannot occur in URL:

xref: https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html

We have tried two approaches to fix this issue in the source OBO. The first one is to enclose invalid URL in quotes:

xref: "https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html"

and the second one is to encode invalid characters |[| and |]| with percent-encoded octets:

xref: https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-%5BEC-RF24%5D-T0003.html

The first approach gives strange incorrect OWL output (maybe because quoted string is interpreted as a comment and not as an external URL):

 <!-- http://purl.obolibrary.org/obo/TEMP#CVCL_AX74 -->

 <owl:Class  rdf:about="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74">
     <rdfs:subClassOf>
         <owl:Restriction>
             <owl:onProperty  rdf:resource="http://purl.obolibrary.org/obo/TEMP#originate_from_same_individual_as"/>
             <owl:someValuesFrom  rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX75"/>
         </owl:Restriction>
     </rdfs:subClassOf>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string"></oboInOwl:hasDbXref>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BTO:BTO:0004188</oboInOwl:hasDbXref>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBI_TaxID:9606</oboInOwl:hasDbXref>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PubMed:7813621</oboInOwl:hasDbXref>
     <oboInOwl:hasOBONamespace  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">cellosaurus</oboInOwl:hasOBONamespace>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF</oboInOwl:hasRelatedSynonym>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF</oboInOwl:hasRelatedSynonym>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF 24</oboInOwl:hasRelatedSynonym>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF24</oboInOwl:hasRelatedSynonym>
     <oboInOwl:id  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CVCL_AX74</oboInOwl:id>
     <oboInOwl:inSubset  rdf:resource="http://purl.obolibrary.org/obo/TEMP#Transformed_cell_line"/>
     <rdfs:comment  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">&quot;Transfected with: UniProtKB; P00552; Transposon Tn5 neo. Transformant: HPV16 E6/E7 (pLXSN16).&quot;</rdfs:comment>
     <rdfs:label  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF24</rdfs:label>
 </owl:Class>
 <owl:Axiom>
     <owl:annotatedSource  rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74"/>
     <owl:annotatedProperty  rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/>
     <owl:annotatedTarget  rdf:datatype="http://www.w3.org/2001/XMLSchema#string"></owl:annotatedTarget>
     <rdfs:label  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html</rdfs:label>
 </owl:Axiom>

This |Axiom| has no value for |owl:annotatedTarget| (where dbxref URL must be located) and instead has |label| tag which is meaningless for the axiom.

And the second approach gives correct OWL output:

 <!-- http://purl.obolibrary.org/obo/TEMP#CVCL_AX74 -->

 <owl:Class  rdf:about="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74">
     <rdfs:subClassOf>
         <owl:Restriction>
             <owl:onProperty  rdf:resource="http://purl.obolibrary.org/obo/TEMP#originate_from_same_individual_as"/>
             <owl:someValuesFrom  rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX75"/>
         </owl:Restriction>
     </rdfs:subClassOf>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BTO:BTO:0004188</oboInOwl:hasDbXref>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBI_TaxID:9606</oboInOwl:hasDbXref>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PubMed:7813621</oboInOwl:hasDbXref>
     <oboInOwl:hasDbXref  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-%5BEC-RF24%5D-T0003.html</oboInOwl:hasDbXref>
     <oboInOwl:hasOBONamespace  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">cellosaurus</oboInOwl:hasOBONamespace>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF</oboInOwl:hasRelatedSynonym>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF</oboInOwl:hasRelatedSynonym>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF 24</oboInOwl:hasRelatedSynonym>
     <oboInOwl:hasRelatedSynonym  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF24</oboInOwl:hasRelatedSynonym>
     <oboInOwl:id  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CVCL_AX74</oboInOwl:id>
     <oboInOwl:inSubset  rdf:resource="http://purl.obolibrary.org/obo/TEMP#Transformed_cell_line"/>
     <rdfs:comment  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">&quot;Transfected with: UniProtKB; P00552; Transposon Tn5 neo. Transformant: HPV16 E6/E7 (pLXSN16).&quot;</rdfs:comment>
     <rdfs:label  rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF24</rdfs:label>
 </owl:Class>

So I suggest to replace invalid symbols in URLs with percent-encoded octets.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/calipho-sib/cellosaurus/issues/2, or mute the thread https://github.com/notifications/unsubscribe/AIdEcqTJ0OtWi6ayLVFRLIzpxrNbH1DOks5qLv3rgaJpZM4I1qey.Web Bug from https://github.com/notifications/beacon/AIdEcpPUnoW0WdRbej5-NBzV1FsAGHzRks5qLv3rgaJpZM4I1qey.gif


Professor and Director of the Dept. of Human Protein Sciences at the Faculty of Medicine of the University of Geneva Group leader at the SIB - Swiss Institute of Bioinformatics

Preferred email: ab@sib.swiss

Alternative email: amos.bairoch@unige.ch

cmungall commented 6 years ago

Looks like this still needs to be done?

btw, @vasiliy-bout I recommend using robot over owltools now http://robot.obolibrary.org/

Also for conversion to OWL I recommend using CURIEs for IDs, s/CVCL_/CVCL:/g

vasiliy-bout commented 6 years ago

@cmungall , thanks for suggestions :)

Yes, it looks like some URLs are still invalid. The latest ROBOT tool fails to convert the current release. Try the following command line:

$ ./robot -vvv convert --input cellosaurus.obo --format owl --output cellosaurus.owl
2018-09-18 11:32:25,489 DEBUG org.obolibrary.robot.IOHelper - Loading ontology cellosaurus.obo with catalog file catalog-v001.xml
...
2018-09-18 11:32:26,529 WARN  org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 152 accepting bad xref with spaces:<TKG:TKG 0732>  LINE:
xref: TKG:TKG 0732
2018-09-18 11:32:26,541 WARN  org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 494 accepting bad xref with spaces:<IZSLER:BS CL 93>  LINE:
xref: IZSLER:BS CL 93
2018-09-18 11:32:26,565 WARN  org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 962 accepting bad xref with spaces:<TKG:TKG 0614>  LINE:
xref: TKG:TKG 0614
2018-09-18 11:32:26,568 WARN  org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 1186 accepting bad xref with spaces:<KCB:KCB 92029YJ>  LINE:
...
--------------------------------------------------------------------------------
Parser: org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser@277f7dd3
    Stack trace:
LINENO: 455206 - expected newline or end of line but found: ].pdf
LINE: xref: https://www.cmrb.eu/media/upload/arxius/blc/Documento_Deposito_Lineas_v32_ES[11-EM].pdf        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:197)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1098)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1054)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1004)
        uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1015)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:323)
        org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:209)
        org.obolibrary.robot.CommandLineHelper.getInputOntology(CommandLineHelper.java:381)
        org.obolibrary.robot.CommandLineHelper.updateInputOntology(CommandLineHelper.java:469)
LINENO: 455206 - expected newline or end of line but found: ].pdf
LINE: xref: https://www.cmrb.eu/media/upload/arxius/blc/Documento_Deposito_Lineas_v32_ES[11-EM].pdf        org.obolibrary.oboformat.parser.OBOFormatParser.error(OBOFormatParser.java:1501)
        org.obolibrary.oboformat.parser.OBOFormatParser.forceParseNlOrEof(OBOFormatParser.java:1339)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseEOL(OBOFormatParser.java:1307)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClauseEOL(OBOFormatParser.java:629)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrame(OBOFormatParser.java:601)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseEntityFrame(OBOFormatParser.java:566)
        org.obolibrary.oboformat.parser.OBOFormatParser.parseOBODoc(OBOFormatParser.java:381)
        org.obolibrary.oboformat.parser.OBOFormatParser.parse(OBOFormatParser.java:335)
        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:79)
        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:58)

There are a lot of warnings that many xref URLs are invalid because they contain spaces: rfc3986 says that URL path components may only contain pchar characters:

   URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

This means that according to RFC all these URLs containing spaces are invalid, for example xref: TKG:TKG 0665 or xref: IZSLER:BS CL 37. But these URL are reported with WARN severity probably because space characters do not lead to a completely broken URL, some servers may still handle them correctly, probably because space characters are not in a gen-delims character set according to RFC:

   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"

But when ROBOT faces a URL with unencoded characters, which are not in pchar but in gen-delims, then it fails with exception, because these URLs like xref: https://www.cmrb.eu/media/upload/arxius/blc/Documento_Deposito_Lineas_v32_ES[11-EM].pdf are completely broken according to RFC and should not be handled by web servers.


I think that this issue may be considered as fixed only when ROBOT tool successfully converts cellosaurus.obo file into OWL format. Ideally, no WARN message should be reported either, but this may be an unachievable goal because probably a lot of people rely on these URLs with spaces despite they are invalid (looks like these URL contain some decimal codes which may be interpreted by scripts or similar automated tools).

cmungall commented 6 years ago

Hmm, unfortunately the OWLAPI obo2owl translation is not likely to change for the immediate future.

I recommend backslashing the characters to escape them. Or better yet, don't use URLs as xrefs, use a rdfs:seeAlso annotation assertion instead

QuarksToQuasars commented 5 years ago

Hi all. Do you happen to know whether there exist any workarounds (e.g. some cleansing/sanitising via some bash/perl string manipulation) or some means to disable certain properties from the obo converter?