Open vasiliy-bout opened 8 years ago
Agree with the proposed change, in release 18 this URL will use percent-encoded octet. Best Amos
On 14.06.2016 21:15, Vasiliy Bout wrote:
Hello,
We use https://github.com/ontodev/robot tool to convert cellosaurus.obo into OWL format. And recently we tried to update this tool to the newest version. But we failed to do this because the newest version cannot convert current version of the cellosaurus.obo file. Another tool |owltools| from https://github.com/owlcollab/owltools gives an error either.
The reason why both tools fails is that the line 224456 contains invalid characters that cannot occur in URL:
xref: https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html We have tried two approaches to fix this issue in the source OBO. The first one is to enclose invalid URL in quotes:
xref: "https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html" and the second one is to encode invalid characters |[| and |]| with percent-encoded octets:
xref: https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-%5BEC-RF24%5D-T0003.html The first approach gives strange incorrect OWL output (maybe because quoted string is interpreted as a comment and not as an external URL):
<!-- http://purl.obolibrary.org/obo/TEMP#CVCL_AX74 --> <owl:Class rdf:about="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74"> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/TEMP#originate_from_same_individual_as"/> <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX75"/> </owl:Restriction> </rdfs:subClassOf> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string"></oboInOwl:hasDbXref> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BTO:BTO:0004188</oboInOwl:hasDbXref> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBI_TaxID:9606</oboInOwl:hasDbXref> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PubMed:7813621</oboInOwl:hasDbXref> <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">cellosaurus</oboInOwl:hasOBONamespace> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF</oboInOwl:hasRelatedSynonym> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF</oboInOwl:hasRelatedSynonym> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF 24</oboInOwl:hasRelatedSynonym> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF24</oboInOwl:hasRelatedSynonym> <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CVCL_AX74</oboInOwl:id> <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/TEMP#Transformed_cell_line"/> <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">"Transfected with: UniProtKB; P00552; Transposon Tn5 neo. Transformant: HPV16 E6/E7 (pLXSN16)."</rdfs:comment> <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF24</rdfs:label> </owl:Class> <owl:Axiom> <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74"/> <owl:annotatedProperty rdf:resource="http://www.geneontology.org/formats/oboInOwl#hasDbXref"/> <owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string"></owl:annotatedTarget> <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-[EC-RF24]-T0003.html</rdfs:label> </owl:Axiom>
This |Axiom| has no value for |owl:annotatedTarget| (where dbxref URL must be located) and instead has |label| tag which is meaningless for the axiom.
And the second approach gives correct OWL output:
<!-- http://purl.obolibrary.org/obo/TEMP#CVCL_AX74 --> <owl:Class rdf:about="http://purl.obolibrary.org/obo/TEMP#CVCL_AX74"> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/TEMP#originate_from_same_individual_as"/> <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/TEMP#CVCL_AX75"/> </owl:Restriction> </rdfs:subClassOf> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">BTO:BTO:0004188</oboInOwl:hasDbXref> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">NCBI_TaxID:9606</oboInOwl:hasDbXref> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">PubMed:7813621</oboInOwl:hasDbXref> <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://www.abmgood.com/Immortalized-Vascular-Endothelial-Cells-%5BEC-RF24%5D-T0003.html</oboInOwl:hasDbXref> <oboInOwl:hasOBONamespace rdf:datatype="http://www.w3.org/2001/XMLSchema#string">cellosaurus</oboInOwl:hasOBONamespace> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF</oboInOwl:hasRelatedSynonym> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF</oboInOwl:hasRelatedSynonym> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF 24</oboInOwl:hasRelatedSynonym> <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECRF24</oboInOwl:hasRelatedSynonym> <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">CVCL_AX74</oboInOwl:id> <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/TEMP#Transformed_cell_line"/> <rdfs:comment rdf:datatype="http://www.w3.org/2001/XMLSchema#string">"Transfected with: UniProtKB; P00552; Transposon Tn5 neo. Transformant: HPV16 E6/E7 (pLXSN16)."</rdfs:comment> <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">EC-RF24</rdfs:label> </owl:Class>
So I suggest to replace invalid symbols in URLs with percent-encoded octets.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/calipho-sib/cellosaurus/issues/2, or mute the thread https://github.com/notifications/unsubscribe/AIdEcqTJ0OtWi6ayLVFRLIzpxrNbH1DOks5qLv3rgaJpZM4I1qey.Web Bug from https://github.com/notifications/beacon/AIdEcpPUnoW0WdRbej5-NBzV1FsAGHzRks5qLv3rgaJpZM4I1qey.gif
Professor and Director of the Dept. of Human Protein Sciences at the Faculty of Medicine of the University of Geneva Group leader at the SIB - Swiss Institute of Bioinformatics
Preferred email: ab@sib.swiss
Looks like this still needs to be done?
btw, @vasiliy-bout I recommend using robot over owltools now http://robot.obolibrary.org/
Also for conversion to OWL I recommend using CURIEs for IDs, s/CVCL_/CVCL:/g
@cmungall , thanks for suggestions :)
Yes, it looks like some URLs are still invalid. The latest ROBOT tool fails to convert the current release. Try the following command line:
$ ./robot -vvv convert --input cellosaurus.obo --format owl --output cellosaurus.owl
2018-09-18 11:32:25,489 DEBUG org.obolibrary.robot.IOHelper - Loading ontology cellosaurus.obo with catalog file catalog-v001.xml
...
2018-09-18 11:32:26,529 WARN org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 152 accepting bad xref with spaces:<TKG:TKG 0732> LINE:
xref: TKG:TKG 0732
2018-09-18 11:32:26,541 WARN org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 494 accepting bad xref with spaces:<IZSLER:BS CL 93> LINE:
xref: IZSLER:BS CL 93
2018-09-18 11:32:26,565 WARN org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 962 accepting bad xref with spaces:<TKG:TKG 0614> LINE:
xref: TKG:TKG 0614
2018-09-18 11:32:26,568 WARN org.obolibrary.oboformat.parser.OBOFormatParser - LINE: 1186 accepting bad xref with spaces:<KCB:KCB 92029YJ> LINE:
...
--------------------------------------------------------------------------------
Parser: org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser@277f7dd3
Stack trace:
LINENO: 455206 - expected newline or end of line but found: ].pdf
LINE: xref: https://www.cmrb.eu/media/upload/arxius/blc/Documento_Deposito_Lineas_v32_ES[11-EM].pdf org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:197)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.actualParse(OWLOntologyManagerImpl.java:1098)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1054)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1004)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntologyFromOntologyDocument(OWLOntologyManagerImpl.java:1015)
org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:323)
org.obolibrary.robot.IOHelper.loadOntology(IOHelper.java:209)
org.obolibrary.robot.CommandLineHelper.getInputOntology(CommandLineHelper.java:381)
org.obolibrary.robot.CommandLineHelper.updateInputOntology(CommandLineHelper.java:469)
LINENO: 455206 - expected newline or end of line but found: ].pdf
LINE: xref: https://www.cmrb.eu/media/upload/arxius/blc/Documento_Deposito_Lineas_v32_ES[11-EM].pdf org.obolibrary.oboformat.parser.OBOFormatParser.error(OBOFormatParser.java:1501)
org.obolibrary.oboformat.parser.OBOFormatParser.forceParseNlOrEof(OBOFormatParser.java:1339)
org.obolibrary.oboformat.parser.OBOFormatParser.parseEOL(OBOFormatParser.java:1307)
org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrameClauseEOL(OBOFormatParser.java:629)
org.obolibrary.oboformat.parser.OBOFormatParser.parseTermFrame(OBOFormatParser.java:601)
org.obolibrary.oboformat.parser.OBOFormatParser.parseEntityFrame(OBOFormatParser.java:566)
org.obolibrary.oboformat.parser.OBOFormatParser.parseOBODoc(OBOFormatParser.java:381)
org.obolibrary.oboformat.parser.OBOFormatParser.parse(OBOFormatParser.java:335)
org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:79)
org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:58)
There are a lot of warnings that many xref
URLs are invalid because they contain spaces: rfc3986 says that URL path components may only contain pchar
characters:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
This means that according to RFC all these URLs containing spaces are invalid, for example xref: TKG:TKG 0665
or xref: IZSLER:BS CL 37
. But these URL are reported with WARN severity probably because space characters do not lead to a completely broken URL, some servers may still handle them correctly, probably because space characters are not in a gen-delims
character set according to RFC:
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
But when ROBOT faces a URL with unencoded characters, which are not in pchar
but in gen-delims
, then it fails with exception, because these URLs like xref: https://www.cmrb.eu/media/upload/arxius/blc/Documento_Deposito_Lineas_v32_ES[11-EM].pdf
are completely broken according to RFC and should not be handled by web servers.
I think that this issue may be considered as fixed only when ROBOT tool successfully converts cellosaurus.obo file into OWL format. Ideally, no WARN
message should be reported either, but this may be an unachievable goal because probably a lot of people rely on these URLs with spaces despite they are invalid (looks like these URL contain some decimal codes which may be interpreted by scripts or similar automated tools).
Hmm, unfortunately the OWLAPI obo2owl translation is not likely to change for the immediate future.
I recommend backslashing the characters to escape them. Or better yet, don't use URLs as xrefs, use a rdfs:seeAlso annotation assertion instead
Hi all. Do you happen to know whether there exist any workarounds (e.g. some cleansing/sanitising via some bash/perl string manipulation) or some means to disable certain properties from the obo converter?
Hello,
We use https://github.com/ontodev/robot tool to convert cellosaurus.obo into OWL format. And recently we tried to update this tool to the newest version. But we failed to do this because the newest version cannot convert current version of the cellosaurus.obo file. Another tool
owltools
from https://github.com/owlcollab/owltools gives an error either.The reason why both tools fails is that the line 224456 contains invalid characters that cannot occur in URL:
We have tried two approaches to fix this issue in the source OBO. The first one is to enclose invalid URL in quotes:
and the second one is to encode invalid characters
[
and]
with percent-encoded octets:The first approach gives strange incorrect OWL output (maybe because quoted string is interpreted as a comment and not as an external URL):
This
Axiom
has no value forowl:annotatedTarget
(where dbxref URL must be located) and instead haslabel
tag which is meaningless for the axiom.And the second approach gives correct OWL output:
So I suggest to replace invalid symbols in URLs with percent-encoded octets.