bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
192 stars 53 forks source link

Error when running FFPE training data from SEQ2C #130

Closed GACGAMA closed 8 months ago

GACGAMA commented 8 months ago

Hello! I'm trying to train the AI model with different inputs such as WGS, TITRATION, SYNTHETIC and FFPE. Im running SomaticSeq on a slurm cluster with singularity wherever possible.

I have had many errors with FFPE samples because the BAM files were not named correctly according to normal-tumor pairs. To correct this, I used:

picard AddOrReplaceReadGroups I=/data/nsobrei2/ggama1/training/bams/FFPE/{1}.bwa.dedup.bam O=/data/nsobrei2/ggama1/training/bams/FFPE_reheader/{1}.bwa.dedup.bam RGID={1} RGLB=na RGPL=ILLUMINA RGPU=na RGSM={1} RGDS={1}.bwa.dedup.bam

Which is basically adding the file name as sample name into read groups. All mutation callers did work without any errors.

But when creating the consensus

somaticseq_parallel.py --output-directory \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/consensus_vcf --genome-reference \$HUMAN_REFERENCE_PATH --dbsnp \$DBSNP_PATH --threads \$THREADS --truth-snv \$TRUTH_SNV --truth-indel \$TRUTH_INDEL paired --tumor-bam-file \$filepathsstarR1 --normal-bam-file \$filepathsstarR2 --mutect2-vcf \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/mutect2/MuTect2.vcf.gz --vardict-vcf \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/vardict/VarDict.vcf.gz --somaticsniper-vcf \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/somaticsniper/SomaticSniper.vcf --muse-vcf \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/muse/MuSE.vcf.gz --strelka-snv \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/strelka2/Strelka.snv.vcf.gz --strelka-indel \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/strelka2/Strelka.indel.vcf.gz --varscan-snv \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/varscan2/VarScan2.snv.vcf.gz --varscan-indel \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/varscan2/VarScan2.indel.vcf.gz --lofreq-snv \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/lofreq/LoFreq.snv.vcf.gz --lofreq-indel \$BASE_PATH/somaticseq/vcf_per_sample/$EXPERIMENT_PATH/\$filenames/lofreq/LoFreq.indel.vcf.gz

This is running, but then gives me the following error for all samples from FFPE (SYNTHETIC did work so the program itself is running ok):


INFO 2024-01-23 18:03:27,384 runPaired            Removed /scratch4/nsobrei2/ggama1/training/somaticseq/vcf_per_sample/FFPE/FFG_GZ_T_24h-F/consensus_vcf/19/intersect.strelka.snv.vcf
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 84, in runPaired_by_region
    run_somaticseq.runPaired(
  File "/home/ggama1/programs/somaticseq/somaticseq/run_somaticseq.py", line 156, in runPaired
    somatic_vcf2tsv.vcf2tsv(is_vcf=outSnv, nbam_fn=nbam, tbam_fn=tbam, truth=truth_snv, cosmic=cosmic, dbsnp=dbsnp,
  File "/home/ggama1/programs/somaticseq/somaticseq/somatic_vcf2tsv.py", line 429, in vcf2tsv
    if dbsnp:    got_dbsnp,    dbsnp_variants,    dbsnp_line    = genome.find_vcf_at_coordinate(my_coordinate, dbsnp_line,   dbsnp,   chrom_seq)
                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/programs/somaticseq/somaticseq/genomicFileHandler/genomic_file_handlers.py", line 577, in find_vcf_at_coordinate
    latest_vcf_run  = catchup_multilines(my_coordinate, latest_vcf_line, vcf_file_handle, chrom_seq)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/programs/somaticseq/somaticseq/genomicFileHandler/genomic_file_handlers.py", line 521, in catchup_multilines
    if whoisbehind(coordinate_j, next_coord.group(), chrom_sequence) == 1:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/programs/somaticseq/somaticseq/genomicFileHandler/genomic_file_handlers.py", line 322, in whoisbehind
    chrom1_position = chrom_sequence[chrom1]
                      ~~~~~~~~~~~~~~^^^^^^^^
KeyError: 'chr1_KI270766v1_alt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ggama1/.conda/envs/somaticseq/bin/somaticseq_parallel.py", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/ggama1/programs/somaticseq/somaticseq/somaticseq_parallel.py", line 308, in <module>
    subdirs = pool.map(runPaired_by_region_i, bed_splitted)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ggama1/.conda/envs/somaticseq/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
KeyError: 'chr1_KI270766v1_alt'
litaifang commented 8 months ago

Does the bam file has contigs (i.e., chr1_KI270766v1_alt) that does not exist in the genome reference file? One possible get around is to create a bed file with only the real chromosomes and pass it before the paired option as --inclusion-region.

GACGAMA commented 8 months ago

I will verify, but I'm using the reference file and bam files from SEQ2C recomendation, so it should be fine. I will be testint with the High-Confidence_Regions_v1.2.bed provided tough

litaifang commented 8 months ago

This is genome reference SEQC2 used: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/technical/reference_genome/GRCh38/ Which bam files did you use?

GACGAMA commented 8 months ago

Im using the following tumor BAM files, with their respective paired normal FFG_GZ_T_24h-B FFG_GZ_T_2h-A
FFG_IL_T_1h
FFG_IL_T_2h FFG_GZ_T_24h-F FFG_GZ_T_6h-A FFG_IL_T_24h

GACGAMA commented 8 months ago

Rerunning files with -include region High-Confidence_Regions_v1.2.bed did work. Still, FFPE files had to be renamed with piccard to work.