Closed fevac closed 1 year ago
One of the possible solution, to make sure that the bam file is valid and also improve the current TAT (BALSAMIC v11.0.1), is to add the workflow mentioned below: https://gatk.broadinstitute.org/hc/en-us/articles/360035891231-Errors-in-SAM-or-BAM-files-can-be-diagnosed-with-ValidateSamFile
Currently running ValidateSamFile on bamfiles produced at different stages of the pipeline:
This to understand in more detail what errors and warnings exist in the bamfiles, where they arise, and where and if they are fixed.
Ran ValidateSamFile on 53 WGS bamfiles from version v11.0.2 of BALSAMIC from the realign-step. The only errors I discovered was consistently "MATE_NOT_FOUND", at varying but always high numbers, often above 100k.
These errors existed in all stages of the bamfile processing and remained constant after dedup and realignment, but was somewhat decreased in the merged bam step. Below is an example following the same sample.
ACC11303A1 | dedup.realign.bam | ACC11303A1 | dedup.bam | validwasp_tumor ACC11303A1 | merged |
---|---|---|---|---|---|
INVALID_PLATFORM_VALUE | 1 | INVALID_PLATFORM_VALUE | 1 | INVALID_PLATFORM_VALUE | 1 |
MATE_NOT_FOUND | 315442 | MATE_NOT_FOUND | 315442 | MATE_NOT_FOUND | 259469 |
However running ValidateSamFile on the bamfiles before Dedup (ran on 3 samples so far) does not show this MATE_NOT_FOUND error, and the suspicion right now is that Sentieon Dedup --rmdup may remove single reads instead of the entire pair, which seems like an unexpected behaviour.
Summary
MATE_NOT_FOUND errors from Picard ValidateSamFile seems to be due to what may be referred to as a bug in these softwares (both Picard MarkDuplicates and Sentieon Dedup) where Unmapped mates and Secondary aligned reads of a read marked or removed as a duplicate are kept in the bamfile.
The issue with the unmapped mate being kept in the bamfile is discussed in an issue in the Picard repo here: https://github.com/broadinstitute/picard/issues/1138
And in an older issue here: https://github.com/broadinstitute/picard/issues/451
This is further confirmed here in addition to how Non-primary aligned reads are kept as well.
MATE_NOT_FOUND errors from Picard MarkDuplicates if REMOVE_DUPLICATES=TRUE
As we are using different tools for removing duplicates depending on if the analysed case is WGS (in which case Sentieon Dedup is used), or TGA / WES (in which case Picard MarkDuplicates is used), I wanted to see if the same error appeared in Picard MarkDuplicates. In all non cases that I have observed (as of the version v11.0.2) Picard MarkDuplicates is run with the option "REMOVE_DUPLICATES=FALSE
" at which case duplicates are not removed but only marked as duplicates.
Running ValidateSamFile on the concatenated_ACC11409A4_XXXXXX_R.sorted.mrkdup.bam found no errors, except for the 1 "INVALID_PLATFORM_VALUE". Which would be as expected since this bamfile would only have duplicates marked, not removed.
When running the same command with "REMOVE_DUPLICATES=TRUE
" the MATE_NOT_FOUND error appears again: 4068 MATE_NOT_FOUND errors. A lower amount of errors, which should be reflective of the smaller amount of reads in total in this application (PANKTTR020).
Missing mate error reads appears to be all either unmapped or secondary alignments
To investigate in more depth the property of the reads that have the MATE_NOT_FOUND error I ran FixMateInformation with the VERBOSE mode, which outputs 99 of the read-names with the errors. I manually extracted about 10 of these reads from before and after Sentieon Dedup, and Picard MarkDuplicates respectively.
In each case the mate which had been removed as a duplicate was properly mapped, whereas the other mate was either unmapped or the secondary alignment of the removed duplicate read.
To get some more info I ran samtools flagstats I extracted all read-entries from the 99 read-names from one WGSPCFC030 (manyduck normal sample), and similarly for a bamfile where duplicates had been removed with Picard MarkDuplicates (PANKTTR020 sample ACC11409A4).
ACC11409A4: picard remove duplicates | manyduck normal: sentieon dedup | |
---|---|---|
total (QC-passed reads + QC-failed reads) | 119 | 146 |
primary | 99 | 99 |
secondary | 20 | 47 |
supplementary | 0 | 0 |
duplicates | 0 | 0 |
primary duplicates | 0 | 0 |
mapped | 20 | 47 |
mapped % | 16.81% | 32.19% |
primary mapped | 0 | 0 |
primary mapped % | 0.00% | 0.00% |
paired in sequencing | 99 | 99 |
read1 | 36 | 38 |
read2 | 63 | 61 |
properly paired | 0 | 0 |
properly paired % | 0.00% | 0.00% |
with itself and mate mapped | 0 | 0 |
singletons | 0 | 0 |
singletons % | 0.00% | 0.00% |
with mate mapped to a different chr | 0 | 0 |
with mate mapped to a different chr (mapQ>=5 | 0 | 0 |
These stats seem to confirm the story, that Picard MarkDuplicates and Sentieon Dedup keeps when removing duplicates the non-primary mapped reads and the unmapped mates.
Reduced amount of MATE_NOT_FOUND errors in mergeBam rules is due to removal of unmapped reads
To investigate the role of the mergeBam rules in reducing the MATE_NOT_FOUND error, I ran AddOrReplaceReadGroups followed by FixMateInformation mirroring the parameters used in version v11.0.2 of BALSAMIC on one of the realigned bamfiles (sample: ACC11273A2). Neither of these options affected the numbers of MATE_NOT_FOUND errors. See table below: | ACC11273A2 | dedup.realign.bam | ACC11273A2 | dedup.realign.bam (RG fix) | ACC11273A2 | dedup.realign.bam (RG fix + FixMate) |
---|---|---|---|---|---|---|
INVALID_PLATFORM_VALUE | 1 | INVALID_PLATFORM_VALUE | 1 | INVALID_PLATFORM_VALUE | 1 | |
MATE_NOT_FOUND | 116793 | MATE_NOT_FOUND | 116793 | MATE_NOT_FOUND | 116793 |
It appears instead that it is due to the removal of unmapped reads that the numbers of MATE_NOT_FOUND errors are decreased. However I still need to confirm this by running ValidateSamFile on a bamfile where the unmapped reads have been removed.
Additionally I would expect that all MATE_NOT_FOUND errors would disappear if all reads with a mapping quality of 0 (including then secondary aligned reads) was removed.
MATE_NOT_FOUND errors exists in MIP case too
The same tool, Picard MarkDuplicates is used in the rare-disease pipeline MIP. To see if this error appears there as well I ran ValidateSamFile on a WGSPCFC030 sample (fairmink), and a EXOKTTR060 sample (cleargibbon), specifically on the final bam used in the analysis, ending with the suffix "sorted_md.cram".
MATE_NOT_FOUND error does appear in these cases too: fairmink: 421328 MATE_NOT_FOUND errors cleargibbon: 15649 MATE_NOT_FOUND errors
Which I don't understand as in MIP I believe the duplicates are only marked. Not removed. Though perhaps these mates are lost for some other reason.
Description
Currently mergeBam rule takes a long time to complete (about 24h in
grandmarmot
) with a significant increase over the previous version (10.0.5). It would be nice to optimise this rule as much as possible.Related to: https://github.com/Clinical-Genomics/BALSAMIC/issues/1015 and linked PRs.
Potential solutions
picard FixMateInformation
is necessary. If not restore to previous state and excludeFixMateInformation
andsamtools
commands. If necessary, investigate how to speed it upAddOrReplaceReadGroups
could be completely omitted if we provide the appropriate RG while doing the alignment inbwa mem
. Note that this might have implications for subsequent steps that use the read group information.BALSAMIC version
v11.0.1
Current affected rule if snakemake workflow related mergeBam_normal() mergeBam_tumor()