PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

Some PC2 URI (rdf:ID) are invalid (the local part is not a valid NCName). #207

Closed IgorRodchenkov closed 9 years ago

IgorRodchenkov commented 9 years ago

Originally reported by Kyle Ellrott - "The file found at http://www.pathwaycommons.org/pc2/downloads/Pathway%20Commons.7.All.BIOPAX.owl.gz contains invalid RDF. The ID on line 83:

Contains invalid characters (the '+'). This pattern continues else where in the file. Parsing the file fails under the Raptor RDF parser and the OpenRDF library." ... "You files also fail under RDF lib: rdflib.exceptions.ParserError: file:///data/pathway_tools/work/download/1.xml:29:0: rdf:ID value is not a valid NCName: RelationshipXref_nucleotide+genbank+identifier_45709210 I've written a cleaner script to deal with the issue: https://github.com/ucscCancer/pathway_tools/blob/master/scripts/clean_rdfxml.py" Interestingly, Jena API, OWL API, Protege editor, OpenLink Virtuoso, Java isUri(), etc., do not complain about this (such URIs). Looks, URIs with '+' were generated by Paxtools's psimi-converter, which does not use biopax-validator's Normalizer.uri(..) method but applies URLEncoder.encode(localPartId) instead. So, I'd change in the psimi-converter to use id.replaceAll("[^-\w]", "_") to fix.
IgorRodchenkov commented 9 years ago

Fixed in the sources; will be no issue in PC2 v8 db.