Add `unify mark and remove duplicates` and `validate bam file` workflows

khurrammaqbool commented 1 year ago

Description

An increased percentage of duplication is observed described in https://github.com/Clinical-Genomics/BALSAMIC/issues/1059. Looking carefully at how the percentage of duplicates is reported and removed, it is observed that the method for reporting, marking/removal of duplicates is not unified in BALSAMIC. There are three bioinformatic tools i.e. fastp, picard and Sentieon used for reporting, marking and removal of duplicates.

The error identified in SAM validation error in mergeBam rule lead to perform a comprehensive validation check of BAM file leading to the identification of invalid flags for mate in the BAM files. It is also relevant to quote the suggestion from invalid bam flag using picardtools: The bam files generated multiple errors showing an invalid bam flag for numerous reads. Samtools could fix a few of them, but not all. Picard tools fixmateinformation command performs better than samtools and fixes the invalid bam flags. The possible origin of these errors could be concatenated fastq file. The current PR fixes the issue, but it would be better to use each fastq as input and merge bam files instead.

It is possible that the problems described in the issues above are linked and require a careful overview to make sure that the standard and recommended workflow is implemented in BALSAMIC for the generation of final and valid BAM files.

Suggested solution

This requires identifying and fixing all the problems associated with validation errors in the final BAM file. This includes

Map the reads per lane and merge the BAM files (https://github.com/Clinical-Genomics/BALSAMIC/pull/1090) and removing the concatenation of FASTQ files
The next step is to compare fastp, picard and Sentieon mark and remove duplicates in TGA, WGS (TN/TO) workflows and unify the workflow
After unifying mark and remove duplicate workflow, it will be crucial to be able to identify 8 possible warnings and 48 possible errors (mentioned in https://github.com/Clinical-Genomics/BALSAMIC/issues/1053) for their identification and removal as part of quality assurance.

This can be closed when:

Describe what needs to be done for this issue to be closed

Blocked by https://github.com/Clinical-Genomics/BALSAMIC/pull/1069 https://github.com/Clinical-Genomics/BALSAMIC/pull/1090 If there are any blocking issues/prs/things in this or other repos. Please link to them.

DAGs: TN_WGS.pdf TN_TGA.pdf

Before submitting

[x] Assign Effort/Gain/Urgency labels

mathiasbio commented 1 year ago

Continued discussion from issue: https://github.com/Clinical-Genomics/BALSAMIC/issues/1053

Implementing intermediate step in Sentieon Dedup removes all instances of duplicate reads

Looking in more details at the usages of Sentieon Dedup (https://support.sentieon.com/manual/usages/general/) I observed these arguments:

--output_dup_read_name: when using this option, the output of the command will not be a BAM file with marked/removed duplicates, but a list of read names for reads that were marked as duplicate.
--dup_read_name DUPLICATE_READ_NAME_FILE: when using this input all reads contained in the DUPLICATE_READ_NAME_FILE will be marked as duplicate, regardless of whether they are primary or non-primary.

They seemed to be able to solve the issue with the missing mate errors by removing all instances of the duplicate read "regardless of whether they are primary or non-primary". It didn't say anything about unmapped reads, but I decided to give it a try and implemented the intermediate step in the parallel alignment PR (https://github.com/Clinical-Genomics/BALSAMIC/pull/1090).

{params.sentieon_exec} driver \
-t {threads} \
-i $shellbamfiles \
--algo LocusCollector \
--fun score_info \
{output.score};

{params.sentieon_exec} driver \
-t {threads} \
-i $shellbamfiles \
--algo Dedup \
--score_info {output.score} \
--metrics {output.metrics} \
--output_dup_read_name {output.dup_names};

{params.sentieon_exec} driver \
-t {threads} \
-i $shellbamfiles \
--algo Dedup \
--rmdup \
--dup_read_name {output.dup_names} \
{output.bam};

And it seems to work as intended, as running validatesamfile on these bamfiles finds no errors.

Perhaps this solution can be implemented in the standard pipeline.

If a read is marked as a duplicate I cannot at least see any reason why unmapped mates or non-primary alignments of this read should be kept in the bamfile. If this is true the questions still remain if removing these additional reads are important enough to justify adding this extra step which will add some computation-time.

To answer the "important enough" part is difficult, as it can be quite difficult to trace the effects of these reads on downstream analysis tools. If I were to guess, the downstream effects of keeping these reads is probably in the range of minor to non-existent, the cleanest option however should be to remove them.

I think the most important deciding factor to include it or not is the effect on runtime. If the effect on runtime is in the range of adding ~30 minutes to the total analysis of a WGS T/N case, I think it may be worth it. But this just my thinking at the moment.

To begin with I will run the same WGS case 3 times each, with and without the extra dedup-step and compare the runtimes.

mathiasbio commented 1 year ago

Adding extra dedup-step does not significantly increase the runtime for the Dedup rule

Ran the dedup-rule on 2 WGS-samples, tumor WGS120X and normal WGS30X, belonging to case uniquewolf. I ran it three times for each version of the dedup-rule "2 step" and "3 step" dedup.

I summarised the runtime for all steps run in the rule within each category of "elapsed", "user" and "sys".

TOTAL	TOTAL	TOTAL
elapsed (m)	user (m)	sys (m)	names
44,07	977,91	13,00	WGS120X_3steps_1
44,50	847,58	20,62	WGS120X_3steps_2
43,42	794,05	16,91	WGS120X_3steps_3
39,17	885,79	9,44	WGS120X_2steps_1
27,47	662,16	10,09	WGS120X_2steps_2
30,26	663,95	11,73	WGS120X_2steps_3
15,82	382,78	8,42	WGS30X_3steps_1
16,20	386,09	8,65	WGS30X_3steps_1
13,42	279,98	4,16	WGS30X_3steps_3
13,35	288,93	4,02	WGS30X_2steps_1
11,93	307,35	4,41	WGS30X_2steps_2
13,28	290,49	3,53	WGS30X_2steps_3

Below are the averages for each type:

elapsed (m) average	user (m) average	sys (m) average	runtype
44,00	873,18	16,84	WGS120X_3steps
32,30	737,30	10,42	WGS120X_2steps
15,15	349,62	7,07	WGS30X_3steps
12,85	295,59	3,99	WGS30X_2steps

Below is a table with the differences "3 step - 2 step" for the averages.

elapsed (m) average diff	user (m) average diff	sys (m) average diff	diff
11,70	135,8766667	6,42	WGS120X 3 - 2
2,30	54,02722222	3,09	WGS30X 3 - 2

I think in conclusion from these tests, adding this extra dedup-step does not affect significantly the runtime and should be fine to include, only increasing the average runtime of a 120X WGS sample with ~12 minutes.

mathiasbio commented 1 year ago

Commenting on the possibility of running fastp as the deduplicate stage instead of the Sentieon Dedup or Picard MarkDuplicate tool, I wonder if it is possible given that we want to keep the workflow parallel upstream of generating the final bamfile.

To explain further, and I don't see any option for including multiple pairs of fastq-files to fastp, which would be necessary to be sure we are removing all duplicates at this stage given that the same fragment may have duplicate reads across different lanes.

Example: Lane 1 - 2 pair-reads for fragment X Lane 2 - 0 pair-reads for fragment X Lane 3 - 1 pair-reads for fragment X Lane 4 - 1 pair-reads for fragment X

Removing duplicates with Fastp would result in 3 duplicates remaining for fragment X.

If Fastp is limited in this way then I think we can decide right away that we cannot rely ENTIRELY on this tool. Perhaps it could be implemented later on as a means of speeding up the analysis a bit by removing some of the duplicates before alignment. But I think this would not be a very high priority thing, and it would make collecting the QC values for % duplicates a bit trickier too.

Possible workaround with fastp dedup There seems to be an option with fastp for splitting the input reads into multiple ones, and in this way we COULD make sure that fastp has all available information to dedup succesfully all lanes.

https://github.com/OpenGene/fastp#output-splitting

However if we do this, I don't think there is a way we can perserve the original lane information. The input files will be concatenated like before, and the split will most likely produce fastq-files that are mixed across lanes.

So In the end I'm skeptical that Fastp can be our dedup solution!

khurrammaqbool commented 1 year ago

Thank you for the update @mathiasbio. Based on the comments above, I suggest we may continue with Sentieon/picard Markdup. We can also add --dont_eval_duplication to fastp and fetch the information from the BAM file instead. This will improve the performance and enable us provide the same measurement for number of duplicates in report and the number of duplicates removed.

mathiasbio commented 1 year ago

Sounds like a good idea! I'll add it!

Clinical-Genomics / BALSAMIC

Add `unify mark and remove duplicates` and `validate bam file` workflows #1095