ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
258 stars 34 forks source link

`bam` and `fastq` files via `fusera` #12

Closed mathemage closed 4 years ago

mathemage commented 4 years ago

sratoolkit works natively with AWS/GCP to access SRA archive files. Most data is stored in this format and requires fastq-dump from sratoolkit to start the pipeline. Some modern SRA entries already contain bam or fastq files, these would be faster to access via fusera.

Add a "Try Fusera" to access fq files first then fall-back on sratoolkit fq dump which is the current default.

Originally posted by @ababaian in https://github.com/ababaian/serratus/issues/5#issue-589616145

mathemage commented 4 years ago

@jefftaylor42 @ababaian Please check if the issue title is clearest, most appropriate and most descriptive possible, as I didn't understand the context of this. Thanks.

superbsky commented 4 years ago

From the documentation of fusera:

To gain access to data from such a controlled access study, users would submit a Data Access Request (DAR) for their research project.

do we have DAR in place?

also, is it somehow related to https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/?

I think we should consider replacing fastq-dump with https://github.com/rvalieris/parallel-fastq-dump

ababaian commented 4 years ago

None of the data we're accessing via a DAR/dbGAP so no we don't need one. (yet)

I believe there is fasterq-dump which is a parallel version of fastq-dump that is in sratoolkit as of February this year. Jeff has worked on optimizing this. He spent 3-4 days optimizing the downloading functions and the system is currently that each downloader node called serratus-dl will run multiple parallel fastq-dump operations, this opens up the networking capacity substantially and helps with more stable CPU usage.

The main 'speed up' of possibly using fusera is that if a non-binary .fq.gz happens to exist, we can download and process that directly and not go through decompression via sra format. I am not sure how many accessions actually have this though so it's kind of 'theoretical' issue at the moment.

Of course if you can show it works with a higher efficiency/speed then lets go with parallel-fastq-dump :+1:

brietaylor commented 4 years ago

What we're doing right now is processing individual accessions serially, by streaming fastq-dump through named pipes to AWS (kind of like this project), and then running N instances of the whole process concurrently. It saturates a CPU about 90% of the time, which is OK, but can definitely be improved.

I've skimmed the source parallel-fastq-dump, which uses the block options of fastq-dump to speed up individual dumps. I think that's actually an improvement over what we're doing right now. But I haven't profiled it so I'm not sure how efficient it is, and it uses seek heavily, so it wouldn't be a drop-in replacement.

Artem had a look at fasterq-dump, and it was showing worse-than-linear growth, which is OK in some applications, but not in ours. (eg. If you want a single accession, 50% faster and 4x the CPU cost is an acceptable tradeoff. If you want 100 accessions, it's not.)

superbsky commented 4 years ago

If SRA contains also aligned data what happened to unaligned sequences? In our case, we are looking for viral genomes but they can be dropped during the alignment or in silico decontamination protocol. @superbsky will check metadata / samples to check this

ababaian commented 4 years ago

This is now much much slower then S3 and will disregard.

centaria commented 2 years ago

Hi Artem and Serratus team, I want to run some customization on Serratus but have encountered SRA download issue. I see that you guys discussed using fasterqdump, parallelization, aws s3 cp, but eventually went with fastqdump. Would you recommend using fasterqdump over aws s3 cp? I need to get about 10k SRA and wanted to parallelize to speed up time. Thank you.