INCATools / semantic-sql

SQL and SQLite builds of OWL ontologies
https://incatools.github.io/semantic-sql/
BSD 3-Clause "New" or "Revised" License
37 stars 3 forks source link

Failed to build SQLite database from OWL file #59

Closed vdancik closed 1 year ago

vdancik commented 1 year ago

We are attempting to parse owl files from Panther (http://data.pantherdb.org/ftp/pathway/current_release/BioPAX.tar.gz) using a docker image docker:linkml/semantic-sql. When we tried to parse Thiamin_metabolism.owl, we get the following error:

> semsql make Thiamin_metabolism.db

cat /usr/local/lib/python3.8/dist-packages/semsql/builder//sql_schema/semsql.sql | sqlite3 .template.db.tmp && \
echo .exit | sqlite3 -echo .template.db.tmp -cmd ".mode csv" -cmd ".import /usr/local/lib/python3.8/dist-packages/semsql/builder//prefixes/prefixes.csv prefix" && \
mv .template.db.tmp .template.db
.exit
robot remove -i Thiamin_metabolism.owl --axioms "equivalent disjoint annotation" -o Thiamin_metabolism-min.owl
relation-graph --disable-owl-nothing true \
                       --ontology-file Thiamin_metabolism-min.owl\
                       --output-file Thiamin_metabolism-relation-graph.tsv.ttl.tmp \
                       --equivalence-as-subclass true \
               --output-subclasses true \
                       --reflexive-subclasses true && \
riot --out RDFXML Thiamin_metabolism-relation-graph.tsv.ttl.tmp > Thiamin_metabolism-relation-graph.tsv.owl.tmp && \
sqlite3 Thiamin_metabolism-relation-graph.tsv.db.tmp -cmd ".mode csv" ".import /usr/local/lib/python3.8/dist-packages/semsql/builder//prefixes/prefixes.csv prefix" && \
rdftab Thiamin_metabolism-relation-graph.tsv.db.tmp < Thiamin_metabolism-relation-graph.tsv.owl.tmp && \
sqlite3 Thiamin_metabolism-relation-graph.tsv.db.tmp -cmd '.separator "\t"' -cmd '.header on' "SELECT subject,predicate,object FROM statements " > Thiamin_metabolism-relation-graph.tsv.tmp && \
mv Thiamin_metabolism-relation-graph.tsv.tmp Thiamin_metabolism-relation-graph.tsv && \
rm Thiamin_metabolism-relation-graph.tsv.*.tmp
2022.10.18 20:50:32:527 zio-default-async-1 INFO org.renci.relationgraph.Main.program:57
    Running reasoner
2022.10.18 20:50:33:801 zio-default-async-1 INFO org.renci.relationgraph.Main.program:60
    Done running reasoner
2022.10.18 20:50:37:266 zio-default-async-1 INFO org.renci.relationgraph.Main.program:70
    Computed relations in 3.371s
cp .template.db Thiamin_metabolism.db.tmp && \
rdftab Thiamin_metabolism.db.tmp < Thiamin_metabolism.owl && \
sqlite3 Thiamin_metabolism.db.tmp -cmd '.separator "\t"' ".import Thiamin_metabolism-relation-graph.tsv entailed_edge" && \
gzip -f Thiamin_metabolism-relation-graph.tsv && \
cat /usr/local/lib/python3.8/dist-packages/semsql/builder//indexes/*.sql | sqlite3 Thiamin_metabolism.db.tmp && \
mv Thiamin_metabolism.db.tmp Thiamin_metabolism.db
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RdfXmlError { kind: Other("eco_ECO:0000501_12 is not a valid rdf:ID value") }', src/main.rs:57:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
make: *** [/usr/local/lib/python3.8/dist-packages/semsql/builder/build.Makefile:49: Thiamin_metabolism.db] Error 101
rm Thiamin_metabolism-min.owl
cmungall commented 1 year ago

Hi @vdancik!

Just to control expectations, this repo is intended primarily for ontologies as sql but it should work for any OWL. I would like to add views for the BioPAX schema just as we have views for OWL, allowing e.g. select * from SmallMolecule, plus some standard composed joins. But at the moment things like SELECT * FROM rdfs_label_statement will give empty results as biopax uses bespoke properties.

Anyway onto your problem. It looks like the OWL is not valid RDF/XML. Rust uses a stricter parser than the OWLAPI or Jena, so these kinds of problems slip by...

The problem is here:

<bp:UnificationXref rdf:ID="eco_ECO:0000501_12">
 <bp:id rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">ECO:0000501</bp:id>
 <bp:db rdf:datatype = "http://www.w3.org/2001/XMLSchema#string">eco</bp:db>
</bp:UnificationXref>

this is invalid RDF and eco_ECO is not defined

I tried a hacky fix to use apache Jena:

✗  riot --output rdfxml tests/input/Thiamin_metabolism.owl > tests/input/Thiamin_metabolism_fixed.owl
19:08:22 WARN  riot            :: [line: 117, col: 49] {W108} Not an XML Name: 'eco_ECO:0000501_12'
19:08:22 WARN  riot            :: [line: 824, col: 48] {W108} Not an XML Name: 'eco_ECO:0000314_4'
19:08:22 WARN  riot            :: [line: 1021, col: 49] {W108} Not an XML Name: 'eco_ECO:0000250_15'

But the output is still invalid and the rust rio parser refuses to process it

I tried putting this in the header:

 xmlns:eco_ECO="http://purl.obolibrary.org/obo/ECO_"
 xmlns:ECO="http://purl.obolibrary.org/obo/ECO_"

But still no luck

I'll ask around and see if anyone has any fixes

vdancik commented 1 year ago

Thank you @cmungall for explaining what's going on and trying find a fix.

cmungall commented 1 year ago

See: https://github.com/pantherdb/Helpdesk/issues/36