INCATools / ontology-development-kit

Bootstrap an OBO Library ontology
http://incatools.github.io/ontology-development-kit/
BSD 3-Clause "New" or "Revised" License
224 stars 54 forks source link

Travis-CI fails when converting ChEBI import #123

Closed jseager7 closed 4 years ago

jseager7 commented 5 years ago

We are using version 1.1.3 of the Ontology Development Kit to develop the PHIPO ontology, and we are importing terms from the ChEBI ontology. Unfortunately, when running tests with Travis-CI, ROBOT always fails when attempting to convert ChEBI, complaining that it cannot load a valid ontology from the IRI provided:

robot convert -I http://purl.obolibrary.org/obo/chebi.owl -o mirror/chebi.owl
INVALID ONTOLOGY IRI ERROR Could not load a valid ontology from IRI: http://purl.obolibrary.org/obo/chebi.owl
For details see: http://robot.obolibrary.org/errors#invalid-ontology-iri-error

(See here for the full Travis-CI log.)

The IRI in question is http://purl.obolibrary.org/obo/chebi.owl, and this resolves to ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl. The FTP address does link to a valid copy of ChEBI (which I've tested by opening in Protege). The error also doesn't occur when we run run.sh make test in our local virtual machine, which runs Docker through Vagrant – the error is specific to Travis-CI.

Our current thinking is that because ChEBI is very large when uncompressed (462 MiB), we're guessing that Travis-CI is having time-out issues because of the length of the download. This could be related to the fact that ChEBI is provided over FTP instead of HTTP, which presumably means the web server isn't performing any transparent (gzip) compression on the file (once compressed, ChEBI is only 29 MiB).

Is there anything the ODK can do to fix this problem?

matentzn commented 5 years ago

This is important, but unfortunately right now, I am not sure how to fix it. Perhaps if we could pull in gzipped dependencies, maybe we could reduce the stress on the network?

@cmungall @rctauber @dosumis

suggestions welcome :)

cmungall commented 5 years ago

I like the idea of all ontologies providing a gzipped product / owlzip but this will take a bit of coordination. Maybe OBO central could do this for a small set of large ontologies.

Even if CHEBI were to be provided by HTTP, I'm never 100% sure about the rules for when gzip on the fly is used.

Anecdotally, I find it is often faster to download to disk using wget/curl then parse with robot/owlapi rather than using robot/owlapi with a remote URL.

We could potentially add an ontofox option to robot extract

jamesaoverton commented 5 years ago

We have a ROBOT Ontofox issue, but no progress to report yet: https://github.com/ontodev/robot/issues/170

matentzn commented 5 years ago

I think we need to make this issue a priority. I can now not make use of travis for any of the phenotype ontologies. I will try now to replace add the wget intermediate, and see whether that helps. If that does not help: @cmungall are we allowed to host a chebi-gz somewhere (lets say github for now), and make purl.obolib.org/obo/chebi/chebi.owl.gz point there?

matentzn commented 5 years ago

okay, so I tried simply replacing the mirror goal in the Makefile (src/ontology/Makefile) like this:

mirror/%.owl:
    mkdir -p mirror && wget -O $@ $(OBO)/$*.owl
    #$(ROBOT) convert -I $@ -o $@
.PRECIOUS: mirror/%.owl

And it seems that, while it still took travis 3 (!) retries to obtain a copy of Chebi, it succeeded. This may not be sustainable, but for now, maybe you can try to make it work like that @jseager7?

jseager7 commented 5 years ago

@matentzn That looks useful, thanks. Just a few questions:

matentzn commented 5 years ago

You can most certainly do that in your case! But the ODK should not have an extra treatment of individual ontologies.. once we have a gz registry we can do this for all ontologies at once. I think the ROBOT convert is redundant -> I would have thought its only purpose is to ensure that all ontologies downloaded conform to the same syntax, but honestly, that is IMHO not necessary.

jseager7 commented 5 years ago

Ok, thanks very much for your help. I'll amend the Makefile and test a release on my fork of PHI-base/phipo. If that works, I might also try the more specific target, just to test if it's possible to load the gzip version of ChEBI.

matentzn commented 5 years ago

Awesome! Let us know how this goes for you! If just switching to gzip download location would work... We could even do something like: check whether purl/obo/x/x.owl.gz exists, if so, use that, else use .owl.

jseager7 commented 5 years ago

Good news: the build is passing now that I've amended the Makefile with your fix. The change is here: https://github.com/jseager7/phipo/commit/794b5cdcbf7e25e959696a163a6455c834693081

ChEBI took about 70 seconds to download, not sure how many retries it took in my case:

mkdir -p mirror && wget -O mirror/chebi.owl http://purl.obolibrary.org/obo/chebi.owl
--2018-11-15 14:01:51--  http://purl.obolibrary.org/obo/chebi.owl
Resolving purl.obolibrary.org (purl.obolibrary.org)... 52.3.123.63
Connecting to purl.obolibrary.org (purl.obolibrary.org)|52.3.123.63|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl [following]
--2018-11-15 14:01:52--  ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl
           => ‘mirror/chebi.owl’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.192.4
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.192.4|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/databases/chebi/ontology ... done.
==> SIZE chebi.owl ... 484882277
==> PASV ... done.    ==> RETR chebi.owl ... done.
Length: 484882277 (462M) (unauthoritative)
chebi.owl           100%[===================>] 462.42M  8.82MB/s    in 70s     
2018-11-15 14:03:03 (6.63 MB/s) - ‘mirror/chebi.owl’ saved [484882277]
cat seed.txt imports/chebi_terms.txt | sort | uniq >  imports/chebi_terms_combined.txt
robot extract -i mirror/chebi.owl -T imports/chebi_terms_combined.txt --method BOT -O http://purl.obolibrary.org/obo/phipo/imports/chebi_import.owl -o imports/chebi_import.owl

See Build #8 for the full log.

I'll try to load the gzip version next and see what the time difference is like.

matentzn commented 5 years ago

Awesome!

jseager7 commented 5 years ago

Late reply, but the gzip version loaded in about 2 seconds, so if the ODK was able to default to gzipped copies of the ontologies (once they have a PURL), it would certainly speed up the build. I'm not sure how much time it would save relative to the whole build though, because converting and processing the files with ROBOT seems to take longer.

I had a look at adding an override for ChEBI, but as far as I can see, it would mean changing the mirror/%.owl rule everywhere to point to a new higher-level rule (something like mirrors) that would run both the wildcard rule (mirror/%.owl) and a specific rule for ChEBI (mirror/chebi.owl). I decided this probably wasn't worth doing, since I didn't want the maintenance problem of diverging from the ODK source.

cmungall commented 5 years ago

This kind of configurability will be easier with #126

matentzn commented 4 years ago

This has been fixed by enabling gzipped imports on huge ontologies like CHEBI.