geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Slow and unstable downloads from EBI upstreams increasing pipeline failures #305

Closed kltm closed 1 year ago

kltm commented 1 year ago

Many downloads are slow to the point of a few megabytes taking hours. Figure out the source/reason and fix. More details to come.

kltm commented 1 year ago

Going through the logs, we seem to have the following slow downloads:

Trying locally, chebi.owl seems to be fine. goa_uniprot_all.gaf.gz is very very slow locally. Interestingly, for goa_uniprot_all.gaf.gz, switching to FTP gives me very fast (normal) speeds.

Looking through the logs, goa_uniprot_all.gaf.gz seems to have been getting worse since around the 14th; chebi.owl since the 25th.

kltm commented 1 year ago

Testing again, I can get good times for goa_uniprot_all.gaf.gz by switching to http or ftp, just not https. That would explain why we took a dive since https://github.com/geneontology/go-site/pull/1914/files .

@cmungall @balhoff Putting chebi being slow to the side for a moment, this seems to be a fundamental issue with EBI's https service (as opposed to http or ftp). If you're getting similar results (I've tried "locally" and at LBL), rather than waiting for an upstream resolution, I'd just as soon switch

https://github.com/geneontology/go-ontology/pull/24202/files https://github.com/geneontology/go-site/pull/1914/files etc.

over to http so we can get things ticking over again and revisit later.

kltm commented 1 year ago

Basically, for the same file, we'll get different download speeds depending on the schema. For example

https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl ~ 4.5KB/s http://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl ~ 15.0MB/s ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl ~ 20.0MB/s

The difference between FTP and HTTP is not a lot and could probably be accounted for in a few different ways, the HTTPS download is remarkably slower.

kltm commented 1 year ago

@alexsign opened a ticket on our behalf and soon after the speed issue seems to have resolved for all the previously slow URLs that I've tested. Thank you!

kltm commented 1 year ago

@alexsign Unfortunately, after a short break, the very slow download speeds have started again.

kltm commented 1 year ago

@alexsign Doing some quick testing from a single site: https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl ~ 424KB/s (~26m) http://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl ~ 422B/s (~27m) ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl ~ 9.0MB/s (~70s) It currently appears that http and https now have the same (slow) download speeds, while ftp is at least an order of magnitude faster.

alexsign commented 1 year ago

@kltm sorry for delay on this. The HTTPS download speed should improve now. Can you please test it and let me know.

kltm commented 1 year ago

@alexsign Cheers! It looks like we just missed the window with out current build; our next full build cycle will start tomorrow at midnight, so we'll get the results over the weekend. I've made a note to check in on them. Thank you for all of your help on this.

kltm commented 1 year ago

@alexsign Doing some testing locally and seeing the current load status, the numbers are completely consistent with https://github.com/geneontology/pipeline/issues/305#issuecomment-1303903774 ; basically, http and https still seem to be extremely slow and ftp is still relatively very very very fast.

alexsign commented 1 year ago

@kltm I sent your feedback to the EBI infrastructure team. Hopefully, they'll figure it out eventually.

alexsign commented 1 year ago

@kltm EBI infrastructure team made changes again. Please test when convenient.

kltm commented 1 year ago

@alexsign Previous runtime for downloads: 10h 33min; current runtime: 22min 59s. A huge improvement! Thank you for running this back and forth for us.

kltm commented 1 year ago

@alexsign I hate to do this; maybe there is a person I should be contacting directly? Since about the 25th or 26th of November, downloads are slow again.

kltm commented 1 year ago

Looks like things are "fast" again.

kltm commented 1 year ago

@alexsign Whoops--my bad. Checking in on this, it is still an ongoing issue, with downloads for our main pipeline runs taking more than ten (10) hours.

kltm commented 1 year ago

All EBI download resources

grep -r "ftp.ebi.ac.uk" ~/local/src/git/go-site/metadata/datasets/ | grep -oh "https:.*"
kltm commented 1 year ago

Getting all non-ontology EBI resources: grep -r "ftp.ebi.ac.uk" ~/local/src/git/go-site/metadata/datasets/ | grep -oh "https:.*" > /tmp/files.txt && sed -e 's/https/ftp/g' /tmp/files.txt > /tmp/files.txt.changed && wget -i /tmp/files.txt.changed

pgaudet commented 1 year ago

From @alexsign A proposal would be to download CHEBI only when there is a release (~ monthly)

pgaudet commented 1 year ago

Also, can you download the compressed file?

alexsign commented 1 year ago

@kltm I can suggest two things:

  1. http://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz downloaded when needed it's ~ 46MB right now
  2. For the next GOA release (in about 1 week time) I'll create MD5 file for https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz which is currently 16GB and updated every 2 month or so. The MD5 file should be downloaded first and used in check if new download needed or not.
kltm commented 1 year ago

Just some current stats:

file thru eta
http://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz 420KB/s ~100s
http://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl 428KB/s ~1600s
ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz 9.0MB/s ~6s
ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl 7.17MB/s ~90s
https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz 432KB/s ~100s
https://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl 432KB/s ~1600s

Noting current stats from local machine. I'm not sure about switching over generally to compressed ontology files--that might end up being a bit fiddly or take us to retool with catalogs (i.e. https://github.com/geneontology/pipeline/issues/27#issuecomment-499615634), but it's worth keeping in mind.

alexsign commented 1 year ago

@kltm I opened another ticket with EBI infrastructure team. Hopefully, it will be more successful than one before.

kltm commented 1 year ago

@alexsign No worries here--thank you for that :) The table above was just to be a note recent on the current state of play from here (not a poke in any way). We appreciate all of your help in mediating this!

kltm commented 1 year ago

@pgaudet We've had the second case in a week where either the EBI server or connection is unstable and we lose a run; like:

00:03:19  Download of https://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/goa_chicken_isoform.gaf.gz failed: Unable to establish SSL connection.

I think I'm ready to implement a local cache/mirror to support our runs so that we can at least deal with late-stage errors. @cmungall Do you see any problems with this? For the moment it would be a GO-only resource.

kltm commented 1 year ago

Noting that grabbing all EBI files, using FTP, takes: mkdir -p /tmp/foo && cd /tmp/foo && grep -r "ftp.ebi.ac.uk" ~/local/src/git/go-site/metadata/datasets/ | grep -oh "https:.*" > /tmp/files.txt && sed -e 's/https/ftp/g' /tmp/files.txt > /tmp/files.txt.changed && time wget -i /tmp/files.txt.changed

Total wall clock time: 39m 53s
Downloaded: 83 files, 38G in 37m 32s (17.2 MB/s)

Which is a win over the 12+ hours we're currently at.

The upload from local was real 50m26.029s.

time s3cmd -c ~/SECRET_CRED --mime-type=text/plain put /tmp/foo/* s3://go-mirror

So let's say one and a half hours per mirror attempt. Pulling from our self-made mirror, we have:

Total wall clock time: 13m 29s
Downloaded: 83 files, 38G in 13m 12s (48.8 MB/s)

This means that we're doing 3x speed over the old EBI FTP and likely have better stability. I'd propose that we either 1) switch back to the old EBI FTP or 2) start the process of running off our own mirror. Tagging @cmungall @pgaudet .

kltm commented 1 year ago

I'd also note that we could front this with LBL Cloudflare for free, which might be a fun experiment :)

kltm commented 1 year ago

Currently looking at using pipeline branch goa-copy-to-mirror to interact with mirror.geneontology.io.

kltm commented 1 year ago

Currently running a full mirror build test.

kltm commented 1 year ago

@dustine32 I'm considering a non-breakign schema change for datasets.schema.yaml in the GO metadata, adding mirror_of. Essentially, I'd like to be able to refer to where we are directly getting the data for the pipeline (source) and how we are mirroring mirror_of in the same location. Any thoughts?

dustine32 commented 1 year ago

@kltm Yeah, I think I'm a fan of mirror_of. Is this just an extra field initially for information only or does other code need to be changed to use it?

kltm commented 1 year ago

This would be an additional field. Only new code--namely my mirroring code--would need to "see" it, and even then only if they cared about mirrors. Everything else could safely ignore it.

dustine32 commented 1 year ago

Ah, yeah, that sounds good to me!

kltm commented 1 year ago

Testing at https://build.geneontology.org/job/geneontology/job/pipeline/job/master/

kltm commented 1 year ago

Now testing on main pipeline branches.

kltm commented 1 year ago

Moving to clearing for testing.

kltm commented 1 year ago

Our download phase has gone from 11+h to 15+m. No oddities yet.