Closed mathemage closed 4 years ago
@jefftaylor42 @ababaian Please check if the issue title is clearest, most appropriate and most descriptive possible, as I didn't understand the context of this. Thanks.
From the documentation of fusera:
To gain access to data from such a controlled access study, users would submit a Data Access Request (DAR) for their research project.
do we have DAR in place?
also, is it somehow related to https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/?
I think we should consider replacing fastq-dump
with https://github.com/rvalieris/parallel-fastq-dump
None of the data we're accessing via a DAR/dbGAP so no we don't need one. (yet)
I believe there is fasterq-dump
which is a parallel version of fastq-dump
that is in sratoolkit
as of February this year. Jeff has worked on optimizing this. He spent 3-4 days optimizing the downloading functions and the system is currently that each downloader node called serratus-dl
will run multiple parallel fastq-dump
operations, this opens up the networking capacity substantially and helps with more stable CPU usage.
The main 'speed up' of possibly using fusera is that if a non-binary .fq.gz
happens to exist, we can download and process that directly and not go through decompression via sra format. I am not sure how many accessions actually have this though so it's kind of 'theoretical' issue at the moment.
Of course if you can show it works with a higher efficiency/speed then lets go with parallel-fastq-dump
:+1:
What we're doing right now is processing individual accessions serially, by streaming fastq-dump
through named pipes to AWS (kind of like this project), and then running N instances of the whole process concurrently. It saturates a CPU about 90% of the time, which is OK, but can definitely be improved.
I've skimmed the source parallel-fastq-dump
, which uses the block options of fastq-dump
to speed up individual dumps. I think that's actually an improvement over what we're doing right now. But I haven't profiled it so I'm not sure how efficient it is, and it uses seek
heavily, so it wouldn't be a drop-in replacement.
Artem had a look at fasterq-dump
, and it was showing worse-than-linear growth, which is OK in some applications, but not in ours. (eg. If you want a single accession, 50% faster and 4x the CPU cost is an acceptable tradeoff. If you want 100 accessions, it's not.)
If SRA contains also aligned data what happened to unaligned sequences? In our case, we are looking for viral genomes but they can be dropped during the alignment or in silico decontamination protocol. @superbsky will check metadata / samples to check this
This is now much much slower then S3 and will disregard.
Hi Artem and Serratus team, I want to run some customization on Serratus but have encountered SRA download issue. I see that you guys discussed using fasterqdump, parallelization, aws s3 cp, but eventually went with fastqdump. Would you recommend using fasterqdump over aws s3 cp? I need to get about 10k SRA and wanted to parallelize to speed up time. Thank you.
sratoolkit
works natively with AWS/GCP to accessSRA
archive files. Most data is stored in this format and requiresfastq-dump
fromsratoolkit
to start the pipeline. Some modern SRA entries already containbam
orfastq
files, these would be faster to access viafusera
.Add a "Try Fusera" to access
fq
files first then fall-back on sratoolkit fq dump which is the current default.Originally posted by @ababaian in https://github.com/ababaian/serratus/issues/5#issue-589616145