Closed jseager7 closed 4 years ago
This is important, but unfortunately right now, I am not sure how to fix it. Perhaps if we could pull in gzipped dependencies, maybe we could reduce the stress on the network?
@cmungall @rctauber @dosumis
suggestions welcome :)
I like the idea of all ontologies providing a gzipped product / owlzip but this will take a bit of coordination. Maybe OBO central could do this for a small set of large ontologies.
Even if CHEBI were to be provided by HTTP, I'm never 100% sure about the rules for when gzip on the fly is used.
Anecdotally, I find it is often faster to download to disk using wget/curl then parse with robot/owlapi rather than using robot/owlapi with a remote URL.
We could potentially add an ontofox option to robot extract
We have a ROBOT Ontofox issue, but no progress to report yet: https://github.com/ontodev/robot/issues/170
I think we need to make this issue a priority. I can now not make use of travis for any of the phenotype ontologies. I will try now to replace add the wget intermediate, and see whether that helps. If that does not help: @cmungall are we allowed to host a chebi-gz somewhere (lets say github for now), and make purl.obolib.org/obo/chebi/chebi.owl.gz point there?
okay, so I tried simply replacing the mirror goal in the Makefile (src/ontology/Makefile) like this:
mirror/%.owl:
mkdir -p mirror && wget -O $@ $(OBO)/$*.owl
#$(ROBOT) convert -I $@ -o $@
.PRECIOUS: mirror/%.owl
And it seems that, while it still took travis 3 (!) retries to obtain a copy of Chebi, it succeeded. This may not be sustainable, but for now, maybe you can try to make it work like that @jseager7?
@matentzn That looks useful, thanks. Just a few questions:
Does the Travis build fail if wget isn't able to fetch ChEBI, or does it tolerate a few retries?
Could we restrict these extra steps to ChEBI only, by using a more specific target? Something like:
mirror/chebi.owl
mkdir -p mirror && wget -O ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz
# do we need to robot convert?
.PRECIOUS: mirror/chebi.owl
(Apologies for any syntax omissions, I'm not experienced with Make)
You can most certainly do that in your case! But the ODK should not have an extra treatment of individual ontologies.. once we have a gz registry we can do this for all ontologies at once. I think the ROBOT convert is redundant -> I would have thought its only purpose is to ensure that all ontologies downloaded conform to the same syntax, but honestly, that is IMHO not necessary.
Ok, thanks very much for your help. I'll amend the Makefile and test a release on my fork of PHI-base/phipo. If that works, I might also try the more specific target, just to test if it's possible to load the gzip version of ChEBI.
Awesome! Let us know how this goes for you! If just switching to gzip download location would work... We could even do something like: check whether purl/obo/x/x.owl.gz exists, if so, use that, else use .owl.
Good news: the build is passing now that I've amended the Makefile with your fix. The change is here: https://github.com/jseager7/phipo/commit/794b5cdcbf7e25e959696a163a6455c834693081
ChEBI took about 70 seconds to download, not sure how many retries it took in my case:
mkdir -p mirror && wget -O mirror/chebi.owl http://purl.obolibrary.org/obo/chebi.owl
--2018-11-15 14:01:51-- http://purl.obolibrary.org/obo/chebi.owl
Resolving purl.obolibrary.org (purl.obolibrary.org)... 52.3.123.63
Connecting to purl.obolibrary.org (purl.obolibrary.org)|52.3.123.63|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl [following]
--2018-11-15 14:01:52-- ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl
=> ‘mirror/chebi.owl’
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.192.4
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.192.4|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/databases/chebi/ontology ... done.
==> SIZE chebi.owl ... 484882277
==> PASV ... done. ==> RETR chebi.owl ... done.
Length: 484882277 (462M) (unauthoritative)
chebi.owl 100%[===================>] 462.42M 8.82MB/s in 70s
2018-11-15 14:03:03 (6.63 MB/s) - ‘mirror/chebi.owl’ saved [484882277]
cat seed.txt imports/chebi_terms.txt | sort | uniq > imports/chebi_terms_combined.txt
robot extract -i mirror/chebi.owl -T imports/chebi_terms_combined.txt --method BOT -O http://purl.obolibrary.org/obo/phipo/imports/chebi_import.owl -o imports/chebi_import.owl
See Build #8 for the full log.
I'll try to load the gzip version next and see what the time difference is like.
Awesome!
Late reply, but the gzip version loaded in about 2 seconds, so if the ODK was able to default to gzipped copies of the ontologies (once they have a PURL), it would certainly speed up the build. I'm not sure how much time it would save relative to the whole build though, because converting and processing the files with ROBOT seems to take longer.
I had a look at adding an override for ChEBI, but as far as I can see, it would mean changing the mirror/%.owl
rule everywhere to point to a new higher-level rule (something like mirrors
) that would run both the wildcard rule (mirror/%.owl
) and a specific rule for ChEBI (mirror/chebi.owl
). I decided this probably wasn't worth doing, since I didn't want the maintenance problem of diverging from the ODK source.
This kind of configurability will be easier with #126
This has been fixed by enabling gzipped imports on huge ontologies like CHEBI.
We are using version 1.1.3 of the Ontology Development Kit to develop the PHIPO ontology, and we are importing terms from the ChEBI ontology. Unfortunately, when running tests with Travis-CI, ROBOT always fails when attempting to convert ChEBI, complaining that it cannot load a valid ontology from the IRI provided:
(See here for the full Travis-CI log.)
The IRI in question is http://purl.obolibrary.org/obo/chebi.owl, and this resolves to
ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl
. The FTP address does link to a valid copy of ChEBI (which I've tested by opening in Protege). The error also doesn't occur when we runrun.sh make test
in our local virtual machine, which runs Docker through Vagrant – the error is specific to Travis-CI.Our current thinking is that because ChEBI is very large when uncompressed (462 MiB), we're guessing that Travis-CI is having time-out issues because of the length of the download. This could be related to the fact that ChEBI is provided over FTP instead of HTTP, which presumably means the web server isn't performing any transparent (gzip) compression on the file (once compressed, ChEBI is only 29 MiB).
Is there anything the ODK can do to fix this problem?