bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Dockerized alignment workflow does not work with multiple input files #124

Closed GACGAMA closed 11 months ago

GACGAMA commented 1 year ago

I have recently installed Somaticseq by using conda env, cloning the repo and installing with pip -e install (somaticseq V 3.7.3)

Using

makeAlignmentScripts.py --output-directory /scratch4/bams --in-fastq1s /scratch4/fastq/a.R1.fastq /scratch4/fastq/b.R1.fastq --in-fastq2s /scratch4/fastq/a.R1.fastq /scratch4/fastq/b.R2.fastq --out-fastq1-name a.R1.fq.gz --out-fastq2-name a.R2.fq.gz --genome-reference /scratch4/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --out-bam a.bam --bam-header '@RG\tID:read_group_001\tPL:illumina\tLB:library_001\tSM:patient_001' --container-tech singularity --threads 6 --run-trimming --split-input-fastqs --run-alignment --run-mark-duplicates --run-workflow

Produces:

a.R1.fq.gz a.R2.fq.gz a.bam aligned.bwa.bam

Even when selecting --trim-software trimmomatic and --markdup-software picard and removing --split-input-fastqs does not work to produce correctly named files!

GACGAMA commented 1 year ago

By leveraging parallel command and cat I was able to make it work:

CSV file

a.R1.fq.gz,a.R2.fq.gz,a
b.R1.fq.gz,b.R2.fq.gz,b

cat mycsvfile.csv | parallel -j 1 --verbose --colsep ',' --link 'makeAlignmentScripts.py --output-directory /scratch4/bams/{3} --in-fastq1s /scratch4/fastq/{1} --in-fastq2s /scratch4/fastq/{2} --out-fastq1-name {3}.R1.merged.fq.gz --out-fastq2-name {3}.R2.merged.fq.gz --genome-reference /scratch4/references/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --out-bam {3}.bam --bam-header "@RG\tID:read_group_001\tPL:illumina\tLB:library_001\tSM:{3}" --container-tech singularity --threads 6 --run-trimming --split-input-fastqs --run-alignment --run-mark-duplicates --run-workflow' & wait

I had to put --bam-header in double quotes "" because I can't put single quotes inside single quotes (parallel executes comands inside single quotes). This did work, but now I'm getting another problem: everytime I use makeAlignmentScripts, it is downloading images of docker hub. Free accounts can only download 100 times each 6 hours. So I'm getting: FATAL: Unable to handle docker://lethalfang/bwa:0.7.17_samtools uri: failed to get checksum for docker://lethalfang/bwa:0.7.17_samtools: reading manifest 0.7.17_samtools in docker.io/lethalfang/bwa: toomanyrequests:You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit INFO 2023-07-05 13:17:14,685 run_script FINISHED RUNNING /scratch4/bams/a/logs/align.2023.07.05.13.15.08.184.cmd in 30.229 seconds with an exit code of 255.

Is there anyway use local images to solve this issue?

UPDATE: Running makeSomaticScripts.py single with -tech singularity on only one sample with 20 threads runs into the same problem of docker pull limit really fast.

litaifang commented 11 months ago

To your first post, aligned.bwa.bam is the intermediate bam file before markdup took place, so the a.bam is the designated final bam file. Other than that, the file names seem to be the ones you designated.

litaifang commented 11 months ago

I don't quite know how singularity works. For docker, if an image is already downloaded, it won't download that image again, but I don't know how to cache image for singularity.

GACGAMA commented 11 months ago

Im trying to work out how to use somaticseq in a singularity HPC. The problem is that for parallel using and when singularity is installed in a server, only the admins can download the packages permanently. This means I will always be limited to 200 pulls from docker, otherwise it works fine! But for the first post, Im still facing the same problem. I cant establish multiple outputs with the same script. Even tough I can use multiple inputs (sample a and sample b R1 and R2 fastqs), I can`t set multiple outputs like a.bam b.bam, this gives an error of multiple outputs where only one output is expected by the script. Is somatic seq outputting all samples to the same BAM file? Or is it rewriting the output, even tough it should give multiple final bams for multiple sample input?

litaifang commented 11 months ago

Yes, when you have multiple inputs of fastq files, it is assumed that those fastq files all belong to the same samples (e.g., multiple sequencing lanes), and yes they are combined into a single bam file.