Speed up SRA download: aws s3 cp or sratoolkit?

ababaian / serratus

Ultra-deep search for novel viruses

http://serratus.io

GNU General Public License v3.0

250 stars 32 forks source link

Speed up SRA download: aws s3 cp or sratoolkit? #262

Closed centaria closed 2 years ago

centaria commented 2 years ago

This is now much much slower then S3 and will disregard.

Originally posted by @ababaian in https://github.com/ababaian/serratus/issues/12#issuecomment-616959147

Hi Artem and Serratus team, I want to run some customization on Serratus but have encountered SRA download issue. I see that you guys discussed using fasterqdump, parallelization, aws s3 cp, but eventually went with fastqdump. Would you recommend using fasterqdump over aws s3 cp? I need to get about 10k SRA and wanted to parallelize to speed up time. Thank you.

ababaian commented 2 years ago

Hey @centaria , I would go with aws s3 cp if you can get bucket locations for the .sra files for all 10k ahead of time. This was originally implemented before the whole SRA-bucket location thing was sorted out, things would move around much more often back in 2020 so we went for prefetch then fastqdump.

My recommendation is NOT to use fasterqdump or any sort of parallelization, while the wall-clock speed of using multiple cores is indeed faster for a single library, the efficiency of download/decompression is MUCH worse. The "optimal" solution for 10K SRA accessions is to run 10K single-threads of prefetch and then fastqdump after the .sra file is on disk. This is assuming you're going cloud-native. If you're using on-prem, the equation much change depending on how fast the networking is. If you provide more details on what (bioinformatic tools wise) you're doing I can offer an opinion

centaria commented 2 years ago

Hi Artem, thanks so much for a rapid response! After the download I’m planning to run megahit/spades. All in cloud, not on prem. If I go with aws s3 cp where to look for SRAs fastq locations in a systematic way? If I do prefatch and then fastqdump, any suggestions on memory optimizations? I noticed many crashes (I think) on larger size fastqs. Thank you very much for your help!:)

asl commented 2 years ago

If you'd run SPAdes, then there is an experimental version that could consume SRA files directly w/o conversion

ababaian commented 2 years ago

What's the documentation on SPAdes direct on SRA files Anton? Is there base-quality trimming?

ababaian commented 2 years ago

@centaria Memory should not be an issue, you may run into disk-space limits, what we did was to randomize the input list of SRA files such that you get an averaged workload on disk (also use like 100-200GB of disk space per node). When decompressing, it helps to write directly to an S3 bucket (or at least that's what we do), which avoids writes while the disk is reading. This also avoids having to have a massive fastq file on disk.

Once you have all the decompressed fastq in your own bucket, you can download/stream them directly into your application. For SPAdes you will require them on disk and they will be indexed etc... As Anton said, I'd try and use the SPAdes direct from the .sra file, all you would have to do then is prefetch and run SPAdes based on your ~/ncbi/prefetch/ directory (where the .sra files get downloaded to)

centaria commented 2 years ago

Thanks you, Artem, this is very helpful! I’ll try what you suggested. Thank you:)

asl commented 2 years ago

What's the documentation on SPAdes direct on SRA files Anton? Is there base-quality trimming?

So far there is no documentation :) SRA files are assumed to "just" work. There is no base-quality trimming, though it could be implemented (it will slow the things down as currently we're ignoring the qualities alltogether)