GroupedAnalysis mode failure when more than 1 normal sample provided

auberginekenobi commented 2 months ago

Repro:
Running AS-p GroupedAnalysis mode via registry.hub.docker.com/jluebeck/prepareaa:v1.3.3: python /home/programs/AmpliconSuite-pipeline-master/GroupedAnalysisAmpSuite.py --input input_file.txt --output_directory redacted --nthreads 4 --ref hg38 --cngain 4.5 --rscript_path /usr/bin/Rscript --cnvkit_dir /usr/local/bin --python3_path /usr/bin/ --AA_src /home/programs/AmpliconArchitect-master/src > job.stdout.log

Contents of input_file.txt:

BS_H \t BS_H.bam \t tumor
BS_3 \t BS_3.bam \t tumor
BS_Q \t BS_Q.bam \t normal
BS_5 \t BS_5.bam \t normal

Expected behaviour: program terminates successfully, produces AA outputs for each tumor sample.

Actual behaviour: program encounters fatal error while creating cmd strings for AA runs. stdout:

Found 2 tumor samples and 2 normals

Setting base argstring for Stage 1 as:
 --nthreads 4 --ref hg38 --cngain 4.5 --cnsize_min 50000 --downsample 10 --rscript_path /usr/bin/Rscript --python3_path /usr/bin//python3 --aa_python_interpreter python --AA_src /home/programs/AmpliconArchitect-master/src --AA_runmode FULL --AA_extendmode EXPLORE --cnvkit_segmentation cbs

More than one normal sample specified. Only the first will be used: BS_Q
Generating individual seeds
BS_H
python /home/programs/AmpliconSuite-pipeline-master/AmpliconSuite-pipeline.py --nthreads 4 --ref hg38 --cngain 4.5 --cnsize_min 50000 --downsample 10 --rscript_path /usr/bin/Rscript --python3_path /usr/bin//python3 --aa_python_interpreter python --AA_src /home/programs/AmpliconArchitect-master/src --AA_runmode FULL --AA_extendmode EXPLORE --cnvkit_segmentation cbs -s BS_H --bam /sbgenomics/workspaces/f6e90687-b0cc-4af1-a024-7bf34a02fcf3/tasks/0964c5d1-232e-4b9f-96b4-a7d9826a6856/cramrecord2bamrecord_1_s_record_constructor/b6e87a21-9afb-4f56-9949-2d1ddf4bb044.bam -o redacted/BS_H --normal_bam /sbgenomics/workspaces/f6e90687-b0cc-4af1-a024-7bf34a02fcf3/tasks/0964c5d1-232e-4b9f-96b4-a7d9826a6856/cramrecord2bamrecord_3_s_record_constructor/38a035c3-6354-4f7e-99b0-72e8d7c94bba.bam --cnvkit_dir /usr/local/bin --no_QC

BS_3
python /home/programs/AmpliconSuite-pipeline-master/AmpliconSuite-pipeline.py --nthreads 4 --ref hg38 --cngain 4.5 --cnsize_min 50000 --downsample 10 --rscript_path /usr/bin/Rscript --python3_path /usr/bin//python3 --aa_python_interpreter python --AA_src /home/programs/AmpliconArchitect-master/src --AA_runmode FULL --AA_extendmode EXPLORE --cnvkit_segmentation cbs -s BS_3 --bam /sbgenomics/workspaces/f6e90687-b0cc-4af1-a024-7bf34a02fcf3/tasks/0964c5d1-232e-4b9f-96b4-a7d9826a6856/cramrecord2bamrecord_2_s_record_constructor/f1b1c4ed-efe5-4be1-a966-39a4d8d1cd93.bam -o redacted/BS_3 --normal_bam /sbgenomics/workspaces/f6e90687-b0cc-4af1-a024-7bf34a02fcf3/tasks/0964c5d1-232e-4b9f-96b4-a7d9826a6856/cramrecord2bamrecord_3_s_record_constructor/38a035c3-6354-4f7e-99b0-72e8d7c94bba.bam --cnvkit_dir /usr/local/bin --no_QC

Merging seeds
sort -k1,1 -k2,2n redacted/BS_3//BS_3_AA_CNV_SEEDS.bed redacted/BS_H//BS_H_AA_CNV_SEEDS.bed | bedtools merge -i - > redacted/BS_H_BS_3_merged_AA_CNV_SEEDS.bed

stderr:

2024-07-25T00:40:37.567866410Z Traceback (most recent call last):
2024-07-25T00:40:37.571462062Z   File "/home/programs/AmpliconSuite-pipeline-master/GroupedAnalysisAmpSuite.py", line 362, in <module>
2024-07-25T00:40:37.572004761Z     cmd_dict = create_AA_AC_cmds(all_lines, base_argstring, grouped_seeds, args.output_directory)
2024-07-25T00:40:37.572007880Z   File "/home/programs/AmpliconSuite-pipeline-master/GroupedAnalysisAmpSuite.py", line 115, in create_AA_AC_cmds
2024-07-25T00:40:37.572011028Z     curr_seeds = grouped_seeds[tf[0]]
2024-07-25T00:40:37.572013660Z KeyError: 'BS_5'

Diagnosis:
On line 356 of GroupedAnalysisAmpSuite.py, the seed for the first normal sample is set correctly but no other normal sample records are altered. Then, if the --skip_AA_on_normal_bam flag is not set, AA runs are initiated for all tumor and normal samples. However, the correct argstrings have only been generated for all tumor and the first normal samples, leading to the error.

jluebeck commented 2 months ago

Thanks Owen, this is a good find, I will take a look ASAP.

auberginekenobi commented 2 months ago

Sure thing, a related note that CNVkit recommends to pool all normal samples to generate the reference but that's maybe a feature request or an improvement rather than strictly related to the bugfix

jluebeck commented 2 months ago

Delayed response on my part as I have been out-of-office.

Temporary workaround is to keep one of the normals as your "Normal" sample (recommend normal blood if possible) and set the other additional normals to "Tumor" status. This will use the one normal sample as the matched normal for all samples, including the additional normal samples. Not perfect but okay if you expect your multiple normal samples to be fairly similar.

jluebeck commented 2 months ago

Hi Owen, if you pull the latest version of AmpliconSuite-pipeline from GitHub you can now access the latest version of AmpliconSuite-pipeline that fixes the issue with multiple normal files in GroupedAnalysis (v1.3.4).

jluebeck commented 2 weeks ago

Version 1.3.4 released and bug fixed in this version

AmpliconSuite / AmpliconSuite-pipeline

GroupedAnalysis mode failure when more than 1 normal sample provided #62