Closed centaria closed 2 years ago
Hey @centaria , I would go with aws s3 cp
if you can get bucket locations for the .sra
files for all 10k ahead of time. This was originally implemented before the whole SRA-bucket location thing was sorted out, things would move around much more often back in 2020 so we went for prefetch
then fastqdump
.
My recommendation is NOT to use fasterqdump
or any sort of parallelization, while the wall-clock speed of using multiple cores is indeed faster for a single library, the efficiency of download/decompression is MUCH worse. The "optimal" solution for 10K SRA accessions is to run 10K single-threads of prefetch
and then fastqdump
after the .sra
file is on disk. This is assuming you're going cloud-native. If you're using on-prem, the equation much change depending on how fast the networking is. If you provide more details on what (bioinformatic tools wise) you're doing I can offer an opinion
Hi Artem, thanks so much for a rapid response! After the download I’m planning to run megahit/spades. All in cloud, not on prem. If I go with aws s3 cp where to look for SRAs fastq locations in a systematic way? If I do prefatch and then fastqdump, any suggestions on memory optimizations? I noticed many crashes (I think) on larger size fastqs. Thank you very much for your help!:)
If you'd run SPAdes, then there is an experimental version that could consume SRA files directly w/o conversion
What's the documentation on SPAdes direct on SRA files Anton? Is there base-quality trimming?
@centaria Memory should not be an issue, you may run into disk-space limits, what we did was to randomize the input list of SRA files such that you get an averaged workload on disk (also use like 100-200GB of disk space per node). When decompressing, it helps to write directly to an S3 bucket (or at least that's what we do), which avoids writes while the disk is reading. This also avoids having to have a massive fastq file on disk.
Once you have all the decompressed fastq in your own bucket, you can download/stream them directly into your application. For SPAdes you will require them on disk and they will be indexed etc... As Anton said, I'd try and use the SPAdes direct from the .sra
file, all you would have to do then is prefetch
and run SPAdes based on your ~/ncbi/prefetch/
directory (where the .sra
files get downloaded to)
Thanks you, Artem, this is very helpful! I’ll try what you suggested. Thank you:)
What's the documentation on SPAdes direct on SRA files Anton? Is there base-quality trimming?
So far there is no documentation :) SRA files are assumed to "just" work. There is no base-quality trimming, though it could be implemented (it will slow the things down as currently we're ignoring the qualities alltogether)
This is now much much slower then S3 and will disregard.
Originally posted by @ababaian in https://github.com/ababaian/serratus/issues/12#issuecomment-616959147
Hi Artem and Serratus team, I want to run some customization on Serratus but have encountered SRA download issue. I see that you guys discussed using fasterqdump, parallelization, aws s3 cp, but eventually went with fastqdump. Would you recommend using fasterqdump over aws s3 cp? I need to get about 10k SRA and wanted to parallelize to speed up time. Thank you.