bokulich-lab / q2-fondue

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Download of large samples is very slow #93

Closed misialq closed 2 years ago

misialq commented 2 years ago

When downloading sequences for some samples (e.g., metagenomes) the download appears very slow, as compared to smaller sized datasets.

Steps to reproduce: Try to fetch sequences for ID ERR1700893 and observe the time it takes.

Expected behaviour: The size of this dataset is approx. 28 GB - it should be a matter of half an hour to an hour to fetch (depends on connection speed).

Actual behaviour: It takes hours (don't know exactly, didn't wait for it to finish).

The problem is that in case of large datasets prefetch silently fails as the default allowed max. size is 20GB. fasterq-dump then takes over but is just much slower. This can easily be fixed by adjusting the max-size param of prefetch to unlimited to allow downloads of any data. See here for some more info.