Open itzamna314 opened 4 years ago
I believe we can and should simplify by always using unpaired mode of bowtie2 and not using --split-files option of fastq-dump. That way, the same command-line should work for any SRA dataset AFAIK. Artem can correct me if I'm wrong here. That way, we only need one pipe, no need for named pipes.
I believe the only option we need for bowtie2 is --very-sensitive-local with /dev/stdin for unpaired fastq input.
I would suggest the following simplification & optimization. Combine the bowtie2, prefetch, fastq-dump and samtools binary files, summarizer.py and the bowtie2 index files into one tarball on S3. When the container starts, install aws cli and python3 base only. Then copy the tarball and decompress it. At that point the container is ready to do
prefetch SRA12345
fastq-dump SRA12345 | bowtie2 | summarizer.py | samtools > output.bam # single pipe
aws s3 cp output.bam s3://serratus-public/out/...
I would suggest the following simplification & optimization. Combine the bowtie2, prefetch, fastq-dump and samtools binary files, summarizer.py and the bowtie2 index files into one tarball on S3. When the container starts, install aws cli and python3 base only. Then copy the tarball and decompress it. At that point the container is ready to do
That actually adds a lot of complexity. Using Docker, we simply build an image with all of those executables installed. Then when we create a container, they're ready to go instantly.
I'll see if I can guess the parameters right for bowtie2
. I've never used it before though, I have no background in biology. I know how to get the executables where they need to be, but not so much what they do or how to run them.
I have no background in Docker, so my bad on that -- I'm trying to learn but am struggling so far.
I think this command-line for bowtie2 should work with unpaired FASTQ from a pipe, sending SAM output to a pipe:
bowtie2 -x INDEXNAME --very-sensitive-local -U /dev/stdin
Contact me by email robert@drive5.com or the serratus-bioinformatics slack channel if you need help with the informatics pipe.
All good, happy to help clear 🐳 stuff up 👍
Where can I find the value from INDEXNAME
for this scenario? It comes from a JOB_JSON
file in the full serratus pipeline.
I think that's the piece I'm missing to get this running. I'll ping you over on the serratus slack 👍. Thanks!
You'll need a genome/sequence file and index of that genome to run bowtie. In essence it takes takes short little bits of DNA and tries to place them in a big piece of DNA. Kind of like a fuzzy regex.
Genome + Bowtie2 Index Files : aws s3 sync s3://serratus-public/seq/cov3a/ ./
As long as those files are in the same directory as bowtie2
you can run -x cov3a
(or whatever the prefix to the .bt2 files is)
This is the beginnings of a container that can run the serratus pipeline end-to-end. I'm not sure what settings I need for
bowtie2
though, so I haven't been able to get past those runs.I'm also not sure if its appropriate to write to all 3 pipes, and then try to do both flavors of bowtie (paired and unpaired?), or if we need to figure out which scenario we're in and only run the one bowtie process.
I think the input to the container is good though, so hopefully we're on the right track