COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

fasta in https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/ -- xz compression? #180

Closed AngieHinrichs closed 2 years ago

AngieHinrichs commented 2 years ago

Recently my fetches of https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_all.fasta have been repeatedly failing. Even with multiple "curl -C -" commands to continue at the offset where the previous fetch failed, sometimes 5 attempts is not enough to get the whole file.

The uncompressed cog_all.fasta is >54GB now (!), but when compressed with xz (which can run multi-threaded), it's much smaller, ~111MB.

Would it be possible to xz-compress cog_all.fasta (and possible other download files as well)? I hope that would help the network transfers, and if you're using cloud storage, should save costs there as well.

rmcolq commented 2 years ago

I think there should be compressed versions of the files available except the newick (which is small anyway) from the same file paths with .gz on the end. However I was told not to remove the uncompressed ones to avoid breaking people's existing pipelines.

Sent from my Galaxy

-------- Original message -------- From: Angie Hinrichs @.> Date: 07/01/2022 22:50 (GMT+00:00) To: COG-UK/dipi-group @.> Cc: Subscribed @.***> Subject: [COG-UK/dipi-group] fasta in https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/ -- xz compression? (Issue #180)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Recently my fetches of https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_all.fasta have been repeatedly failing. Even with multiple "curl -C -" commands to continue at the offset where the previous fetch failed, sometimes 5 attempts is not enough to get the whole file.

The uncompressed cog_all.fasta is >54GB now (!), but when compressed with xz (with can run multi-threaded), it's much smaller, ~111MB.

Would it be possible to xz-compress cog_all.fasta (and possible other download files as well)? I hope that would help the network transfers, and if you're using cloud storage, should save costs there as well.

— Reply to this email directly, view it on GitHubhttps://github.com/COG-UK/dipi-group/issues/180, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACLIWO6AISLKPOQIGMNGVS3UU5UZTANCNFSM5LP2BQVA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.Message ID: @.***>

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

AngieHinrichs commented 2 years ago

Oh, well, that's kind of embarrassing, I should have just tried cog_all.fasta.gz! 😆 https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/ doesn't offer a listing but, um, yeah. Looks like cog_all.fasta.gz is 7GB -- better than 54GB, but still, could be 0.1GB with xz. :) Thanks!

SamStudio8 commented 2 years ago

@AngieHinrichs, you might find https://data.covid19.climb.ac.uk/changelog a useful log to check in on occasionally as we try our best to put up notifications of any changes to the data set and our processes -- like when compressed outputs were added.

AngieHinrichs commented 2 years ago

Noted, thanks @SamStudio8. It would be great to get https://www.cogconsortium.uk/tools-analysis/public-data-analysis-2/ updated to point to the compressed versions too -- I will email the address on that page, contact@cogconsortium.uk.