cURL timeout on very large ontologies

jseager7 commented 1 year ago

When running build.sh I've noticed that creating mirrors of some very large ontologies fails because cURL can't download the ontology file before hitting the timeout limit (200 seconds).

Here's an example of the error:

curl: (28) Operation timed out after 199788 milliseconds with 674767788 out of 703555307 bytes received
Warning: Problem : timeout. Will retry in 1 seconds. 4 retries left.
Throwing away 674767788 bytes

This mainly affects the ChEBI OWL file, which is huge (~670 MB). I can download about 630 MB before timing out.

I tried to override the curl command in phipo.MAKEFILE to set a higher timeout limit (--max-time 600), by copying these lines to the makefile:

## ONTOLOGY: chebi
.PHONY: mirror-chebi
.PRECIOUS: $(MIRRORDIR)/chebi.owl
mirror-chebi: | $(TMPDIR)
    if [ $(MIR) = true ] && [ $(IMP) = true ]; then curl -L $(URIBASE)/chebi.owl --create-dirs -o $(MIRRORDIR)/chebi.owl --retry 4 --max-time 600 &&\
        $(ROBOT) convert -i $(MIRRORDIR)/chebi.owl -o $@.tmp.owl &&\
        mv $@.tmp.owl $(TMPDIR)/$@.owl; fi

But this doesn't seem to have any effect. The console still reports that approximately 3 minutes are left once the download starts:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   328    0   328    0     0   1566      0 --:--:-- --:--:-- --:--:--  1569
 12  670M   12 86.3M    0     0  3183k      0  0:03:35  0:00:27  0:03:08 3207k

Is there anything I'm missing?

matentzn commented 1 year ago

First of all: @jseager7 GREAT to hear from you!

Secondly:

Nearly everyone has switched now to CHEBI and PRO subsets which are managed by the OBO Phenotype community, eg: https://github.com/obophenotype/human-phenotype-ontology/blob/master/src/ontology/hp-odk.yaml#L48 This speeds up the process rapidly and reduces the memory consumption.
CHEBI is known to be flaky, and something trying on the next day helps
You can set the download parameters in ODK using mirror_retry_download and mirror_max_time_download in the imports: section (not on a per ontology basis, for all ontologies at once).

Let me know if these help!

jseager7 commented 1 year ago

@matentzn Thanks for the suggestions. Unfortunately, none of them have solved the problem.

I tried adding the ChEBI slim to phipo-odk.yaml but build.sh just tried to download the 670M file again.

My connection doesn't seem flaky since the data is being transferred at a steady rate, I presume it's just not fast enough to finish before the timeout.

I tried setting mirror_max_time_download to 600 and build.sh finished without timing out, but then it timed out on prepare_release.sh instead. See below for the console log.

if [ true  = true ] && [ true  = true ]; then curl -L http://purl.obolibrary.org/obo/chebi.owl --create-dirs -o mirror/chebi.owl --retry 4 --max-time 200 &&\
        robot --catalog catalog-v001.xml convert -i mirror/chebi.owl -o mirror-chebi.tmp.owl &&\
        mv mirror-chebi.tmp.owl tmp/mirror-chebi.owl; fi
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   328    0   328    0     0   1426      0 --:--:-- --:--:-- --:--:--  1426
 75  670M   75  508M    0     0  2601k      0  0:04:24  0:03:20  0:01:04 2649k
curl: (28) Operation timed out after 199777 milliseconds with 532725848 out of 703555307 bytes received
Warning: Problem : timeout. Will retry in 1 seconds. 4 retries left.
Throwing away 532725848 bytes

I still have some leftover stuff in phipo.Makefile that was probably trying to solve problems with ChEBI. Maybe this is now causing problems:

imports/chebi_import.owl: mirror/chebi.owl imports/chebi_terms_combined.txt
    if [ $(IMP) = true ]; then $(ROBOT) extract -i $< -T imports/chebi_terms_combined.txt --force true --method BOT \
        annotate --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) --output $@.tmp.owl && mv $@.tmp.owl $@; fi

.PRECIOUS: imports/chebi_import.owl

I don't really want to keep running prepare_release.sh since it seems to be downloading a mirror of every ontology every time, which is wasting time and data (possibly related: https://github.com/INCATools/ontology-development-kit/issues/863). It's also throwing away hundreds of megabytes of downloaded data every time ChEBI times out.

matentzn commented 1 year ago

Did you run the update_repo workflow? Can you point me to a PR?

jseager7 commented 1 year ago

I updated to ODK v1.4.1 today with the update_repo workflow. I think I did this before trying the release.

There's no PR yet, but you can check the release branch on our repo. Here's the diff:

https://github.com/PHI-base/phipo/compare/release

jseager7 commented 1 year ago

If you want me to make a PR so we can collaborate on fixing this, I'm happy to do that.

matentzn commented 1 year ago

Yes better a draft PR

balhoff commented 1 year ago

@jseager7 for CHEBI in particular you could use http://purl.obolibrary.org/obo/chebi.owl.gz, it's only 46 MB.

jseager7 commented 1 year ago

I ended up fixing this by using the slim version of ChEBI. The reason the fix didn't work at first was because I forgot to run the command to update the Makefile.

Thanks @balhoff for the suggestion about using compressed versions of the ontologies. I might use this for other ontologies.

To use the compressed versions, do you just add the PURLs to the ODK YAML file as a mirror_from property? For example:

    - id: chebi
      make_base: TRUE
      mirror_from: http://purl.obolibrary.org/obo/chebi.owl.gz

INCATools / ontology-development-kit

cURL timeout on very large ontologies #884