Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

Optimise mergeBam rule #1053

Closed fevac closed 1 year ago

fevac commented 1 year ago

Description

Currently mergeBam rule takes a long time to complete (about 24h in grandmarmot) with a significant increase over the previous version (10.0.5). It would be nice to optimise this rule as much as possible.

Related to: https://github.com/Clinical-Genomics/BALSAMIC/issues/1015 and linked PRs.

Potential solutions

BALSAMIC version v11.0.1

Current affected rule if snakemake workflow related mergeBam_normal() mergeBam_tumor()

khurrammaqbool commented 1 year ago

One of the possible solution, to make sure that the bam file is valid and also improve the current TAT (BALSAMIC v11.0.1), is to add the workflow mentioned below: https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile

image

mathiasbio commented 1 year ago

Currently running ValidateSamFile on bamfiles produced at different stages of the pipeline:

This to understand in more detail what errors and warnings exist in the bamfiles, where they arise, and where and if they are fixed.

mathiasbio commented 1 year ago

Ran ValidateSamFile on 53 WGS bamfiles from version v11.0.2 of BALSAMIC from the realign-step. The only errors I discovered was consistently "MATE_NOT_FOUND", at varying but always high numbers, often above 100k.

These errors existed in all stages of the bamfile processing and remained constant after dedup and realignment, but was somewhat decreased in the merged bam step. Below is an example following the same sample.

ACC11303A1 dedup.realign.bam ACC11303A1 dedup.bam validwasp_tumor ACC11303A1 merged
INVALID_PLATFORM_VALUE 1 INVALID_PLATFORM_VALUE 1 INVALID_PLATFORM_VALUE 1
MATE_NOT_FOUND 315442 MATE_NOT_FOUND 315442 MATE_NOT_FOUND 259469

However running ValidateSamFile on the bamfiles before Dedup (ran on 3 samples so far) does not show this MATE_NOT_FOUND error, and the suspicion right now is that Sentieon Dedup --rmdup may remove single reads instead of the entire pair, which seems like an unexpected behaviour.

mathiasbio commented 1 year ago

Summary

MATE_NOT_FOUND errors from Picard ValidateSamFile seems to be due to what may be referred to as a bug in these softwares (both Picard MarkDuplicates and Sentieon Dedup) where Unmapped mates and Secondary aligned reads of a read marked or removed as a duplicate are kept in the bamfile.

The issue with the unmapped mate being kept in the bamfile is discussed in an issue in the Picard repo here: https://github.com/broadinstitute/picard/issues/1138 And in an older issue here: https://github.com/broadinstitute/picard/issues/451

This is further confirmed here in addition to how Non-primary aligned reads are kept as well.

MATE_NOT_FOUND errors from Picard MarkDuplicates if REMOVE_DUPLICATES=TRUE

As we are using different tools for removing duplicates depending on if the analysed case is WGS (in which case Sentieon Dedup is used), or TGA / WES (in which case Picard MarkDuplicates is used), I wanted to see if the same error appeared in Picard MarkDuplicates. In all non cases that I have observed (as of the version v11.0.2) Picard MarkDuplicates is run with the option "REMOVE_DUPLICATES=FALSE" at which case duplicates are not removed but only marked as duplicates.

Running ValidateSamFile on the concatenated_ACC11409A4_XXXXXX_R.sorted.mrkdup.bam found no errors, except for the 1 "INVALID_PLATFORM_VALUE". Which would be as expected since this bamfile would only have duplicates marked, not removed.

When running the same command with "REMOVE_DUPLICATES=TRUE" the MATE_NOT_FOUND error appears again: 4068 MATE_NOT_FOUND errors. A lower amount of errors, which should be reflective of the smaller amount of reads in total in this application (PANKTTR020).

Missing mate error reads appears to be all either unmapped or secondary alignments

To investigate in more depth the property of the reads that have the MATE_NOT_FOUND error I ran FixMateInformation with the VERBOSE mode, which outputs 99 of the read-names with the errors. I manually extracted about 10 of these reads from before and after Sentieon Dedup, and Picard MarkDuplicates respectively.

In each case the mate which had been removed as a duplicate was properly mapped, whereas the other mate was either unmapped or the secondary alignment of the removed duplicate read.

To get some more info I ran samtools flagstats I extracted all read-entries from the 99 read-names from one WGSPCFC030 (manyduck normal sample), and similarly for a bamfile where duplicates had been removed with Picard MarkDuplicates (PANKTTR020 sample ACC11409A4).

ACC11409A4: picard remove duplicates manyduck normal: sentieon dedup
total (QC-passed reads + QC-failed reads) 119 146
primary 99 99
secondary 20 47
supplementary 0 0
duplicates 0 0
primary duplicates 0 0
mapped 20 47
mapped % 16.81% 32.19%
primary mapped 0 0
primary mapped % 0.00% 0.00%
paired in sequencing 99 99
read1 36 38
read2 63 61
properly paired 0 0
properly paired % 0.00% 0.00%
with itself and mate mapped 0 0
singletons 0 0
singletons % 0.00% 0.00%
with mate mapped to a different chr 0 0
with mate mapped to a different chr (mapQ>=5 0 0

These stats seem to confirm the story, that Picard MarkDuplicates and Sentieon Dedup keeps when removing duplicates the non-primary mapped reads and the unmapped mates.

Reduced amount of MATE_NOT_FOUND errors in mergeBam rules is due to removal of unmapped reads

To investigate the role of the mergeBam rules in reducing the MATE_NOT_FOUND error, I ran AddOrReplaceReadGroups followed by FixMateInformation mirroring the parameters used in version v11.0.2 of BALSAMIC on one of the realigned bamfiles (sample: ACC11273A2). Neither of these options affected the numbers of MATE_NOT_FOUND errors. See table below: ACC11273A2 dedup.realign.bam ACC11273A2 dedup.realign.bam (RG fix) ACC11273A2 dedup.realign.bam (RG fix + FixMate)
INVALID_PLATFORM_VALUE 1 INVALID_PLATFORM_VALUE 1 INVALID_PLATFORM_VALUE 1
MATE_NOT_FOUND 116793 MATE_NOT_FOUND 116793 MATE_NOT_FOUND 116793

It appears instead that it is due to the removal of unmapped reads that the numbers of MATE_NOT_FOUND errors are decreased. However I still need to confirm this by running ValidateSamFile on a bamfile where the unmapped reads have been removed.

Additionally I would expect that all MATE_NOT_FOUND errors would disappear if all reads with a mapping quality of 0 (including then secondary aligned reads) was removed.

mathiasbio commented 1 year ago

MATE_NOT_FOUND errors exists in MIP case too

The same tool, Picard MarkDuplicates is used in the rare-disease pipeline MIP. To see if this error appears there as well I ran ValidateSamFile on a WGSPCFC030 sample (fairmink), and a EXOKTTR060 sample (cleargibbon), specifically on the final bam used in the analysis, ending with the suffix "sorted_md.cram".

MATE_NOT_FOUND error does appear in these cases too: fairmink: 421328 MATE_NOT_FOUND errors cleargibbon: 15649 MATE_NOT_FOUND errors

Which I don't understand as in MIP I believe the duplicates are only marked. Not removed. Though perhaps these mates are lost for some other reason.