ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

:construction: WIP: Batch run #125

Open itzamna314 opened 4 years ago

itzamna314 commented 4 years ago

This is the beginnings of a container that can run the serratus pipeline end-to-end. I'm not sure what settings I need for bowtie2 though, so I haven't been able to get past those runs.

I'm also not sure if its appropriate to write to all 3 pipes, and then try to do both flavors of bowtie (paired and unpaired?), or if we need to figure out which scenario we're in and only run the one bowtie process.

I think the input to the container is good though, so hopefully we're on the right track

rcedgar commented 4 years ago

I believe we can and should simplify by always using unpaired mode of bowtie2 and not using --split-files option of fastq-dump. That way, the same command-line should work for any SRA dataset AFAIK. Artem can correct me if I'm wrong here. That way, we only need one pipe, no need for named pipes.

rcedgar commented 4 years ago

I believe the only option we need for bowtie2 is --very-sensitive-local with /dev/stdin for unpaired fastq input.

rcedgar commented 4 years ago

I would suggest the following simplification & optimization. Combine the bowtie2, prefetch, fastq-dump and samtools binary files, summarizer.py and the bowtie2 index files into one tarball on S3. When the container starts, install aws cli and python3 base only. Then copy the tarball and decompress it. At that point the container is ready to do

prefetch SRA12345

fastq-dump SRA12345 | bowtie2 | summarizer.py | samtools > output.bam # single pipe

aws s3 cp output.bam s3://serratus-public/out/...

itzamna314 commented 4 years ago

I would suggest the following simplification & optimization. Combine the bowtie2, prefetch, fastq-dump and samtools binary files, summarizer.py and the bowtie2 index files into one tarball on S3. When the container starts, install aws cli and python3 base only. Then copy the tarball and decompress it. At that point the container is ready to do

That actually adds a lot of complexity. Using Docker, we simply build an image with all of those executables installed. Then when we create a container, they're ready to go instantly.

I'll see if I can guess the parameters right for bowtie2. I've never used it before though, I have no background in biology. I know how to get the executables where they need to be, but not so much what they do or how to run them.

rcedgar commented 4 years ago

I have no background in Docker, so my bad on that -- I'm trying to learn but am struggling so far.

I think this command-line for bowtie2 should work with unpaired FASTQ from a pipe, sending SAM output to a pipe:

bowtie2 -x INDEXNAME --very-sensitive-local -U /dev/stdin

Contact me by email robert@drive5.com or the serratus-bioinformatics slack channel if you need help with the informatics pipe.

itzamna314 commented 4 years ago

All good, happy to help clear 🐳 stuff up 👍

Where can I find the value from INDEXNAME for this scenario? It comes from a JOB_JSON file in the full serratus pipeline.

I think that's the piece I'm missing to get this running. I'll ping you over on the serratus slack 👍. Thanks!

ababaian commented 4 years ago

You'll need a genome/sequence file and index of that genome to run bowtie. In essence it takes takes short little bits of DNA and tries to place them in a big piece of DNA. Kind of like a fuzzy regex.

Genome + Bowtie2 Index Files : aws s3 sync s3://serratus-public/seq/cov3a/ ./

As long as those files are in the same directory as bowtie2 you can run -x cov3a (or whatever the prefix to the .bt2 files is)