Clinical-Genomics / BALSAMIC

Bioinformatic Analysis pipeLine for SomAtic Mutations In Cancer
https://balsamic.readthedocs.io/
MIT License
44 stars 16 forks source link

[Assessment] Extra variant calls by Manta after new release v13.0.0 #1364

Closed mathiasbio closed 8 months ago

mathiasbio commented 9 months ago

Description

After the changes to the core pipeline in V13.0.0 a substantial increase in Manta variants has been observed.

This can be read up on a bit more in the "Number of variants" section of the Balsamic validation report for V13.0.0. (https://github.com/Clinical-Genomics/validations/blob/main/docs/balsamic/balsamic13.0_validation_report.md)

But in summary the number of SVs for two cases was significantly increased after this update:

civilsole, from 8900 to 10200 setamoeba, from 753 to 1460

This has also been observed in production cases, to an even larger extent. Increasing the priority of this assessment.

Suggested approach

Run the same sample using different bamfiles, in a step-wise process adding and removing changes between V12 and V13 to isolate which update results in the change. Then identify a way to proceed.

Conclusion (current)

The correction of the trimming in V13, by first trimming UMIs then quality-trimming, resulted in previously soft-clipped bases being hard-trimmed, which probably resulted in them not being ignored by Manta and consequently lead to the extra variant calls.

To fix this we should likely implement some additional filters!

Criteria

The number of SV calls from Manta

Origin

No response

Anything else?

No response

mathiasbio commented 8 months ago

This issue has popped up in production as well. For instance case X (see CG doc)

Which is a TGA case with a total clinical SVs of: 15755 the majority of these come from Manta, and checking the original Manta results in the vcf-dir: SV.somatic.X.manta.vcf.gz

There were 15488 Manta variants after the initial calls.

I have re-run the case with the balsamic v12.0.2 version and instead reached 3771 calls, which is still substantial but significantly different from the v13.0.0 calls.

There are a few steps that differ in the generation of the final bamfiles between the two versions, specifically some post-processing steps which have been removed and which I have tested re-applying ot the v13.0.0 bamfile before restarting Manta.

On the other hand, re-running the v12.0.2 generated bamfile with this manual manta-script resulted in 3771 calls. In conclusion at the moment it seems that it has nothing to do with the missing post-processing steps. But seems to be more due to some core changes in the alignment itself...

mathiasbio commented 8 months ago

Looking in the VCF output from Manta in a bit more detail it seems that nearly all variants have PR:SR of 0,0:0,0. Which is very strange. This is true of the v12.0.2 version as well.

FORMAT=

FORMAT=0.999">

With only 300 variants in v13.0.0 with any PR and SR support, and 698 in v12.0.2.

This stat is akin to the read-depth in SNV and InDel callers, and I found it very confusing why these variants even exist. Looking then in the illumina github issues I found this bug report: https://github.com/Illumina/manta/issues/42

Where illumina devs explained that the issue with these variants is that the analysis had been run on enrichment data without the --exome option, which is the case for us as well. According to their documentation it is recommended to run this option in all targeted approaches:

https://github.com/Illumina/manta/tree/master/docs/userGuide#exometargeted

Supplying the --exome flag at configuration time will provide appropriate settings for WES and other regional enrichment analyses. At present this flag disables all high depth filters, which are designed to exclude pericentromeric reference compressions in the WGS case but cannot be applied correctly to a targeted analysis.

It's unclear however if this will affect the number of variants called, or if it will only create accurate stats for the variants, adding the PR:SR info in these MaxDepth variants.

fevac commented 8 months ago

Thanks @mathiasbio for such investigation. So if I understand correctly there are 2 things happening:

  1. The alignment has changed and that somehow drives up the number of variants
    • Question: have you run the variant detection rules from version v13 using the alignment from v12? I would assume you also get a high number of variants
    • What are the differences in the alignments apart from the added unmapped reads?
    • If we would subset the alignment from v13 to include only the reads present in the the alignment from v12 or exclude the unmapped reads, would we go down in number of detected variants? If so, we could possibly assume that it's not about the alignment itself but possibly due to the addition of unmapped reads
  2. We are using the wrong settings for exome and TGA with manta
    • What happen with the number of variants when we add the recommended flag? This is possibly the most interesting question to check, as if the number gets restore we also get a way to move forward
mathiasbio commented 8 months ago
  1. Yes, it seems that when I run Manta on the old bamfile, via the old workflow OR via the manual little manta script I made, I get the same "low" number of variants, that is 3771, and regardless of what I do to make the new bamfile resemble the old one, by rerunning each of the post-processing steps which we have removed in V13.0.0, it doesn't change the number of calls. So it seems to have to do with the basic alignment somehow! Or the Dedup, or some tiny changes in the fastp-trimming.

Regarding your questions:

  1. They have finished now, and the numbers are the same! But we now have the correct info about read support of the variant, and here's a snippet of the PR:SR values:
850,1:3137,2
526,1:1687,2
202,1:655,2
468,1:1766,2
459,1:1623,2
600,1:2188,2
670,1:2167,2
484,1:1444,2
499,1:1459,2
606,1:1438,2
517,0:2152,4
564,1:1945,2
523,1:1661,2
576,1:2078,2
339,1:914,2

Where the first element in each list is the REF variant and the second is the ALT. This at least shows that we're allowed quite low VAFs, below 1%. And I think that since we're calling variants with such small support, I suppose it is not that surprising that some small changes in the alignment could have a large effect. Maybe there is some small difference in the default settings of the Sentieon bwa mem and the regular bwa mem which results in more strange reads being aligned 🤔

mathiasbio commented 8 months ago

I checked the flagstats for the case, which for v12 is immediately after MarkDuplicates and v13 is immediately after Dedup:

Flagstats v13:

47729564 + 0 in total (QC-passed reads + QC-failed reads) 47282300 + 0 primary 447264 + 0 secondary 0 + 0 supplementary 8401052 + 0 duplicates 8401052 + 0 primary duplicates 47727777 + 0 mapped (100.00% : N/A) 47280513 + 0 primary mapped (100.00% : N/A) 47282300 + 0 paired in sequencing 23641150 + 0 read1 23641150 + 0 read2 46611536 + 0 properly paired (98.58% : N/A) 47279574 + 0 with itself and mate mapped 939 + 0 singletons (0.00% : N/A) 517482 + 0 with mate mapped to a different chr 456463 + 0 with mate mapped to a different chr (mapQ>=5)

Flagstats v12:

47780300 + 0 in total (QC-passed reads + QC-failed reads) 47312824 + 0 primary 467476 + 0 secondary 0 + 0 supplementary 8404524 + 0 duplicates 8404524 + 0 primary duplicates 47778484 + 0 mapped (100.00% : N/A) 47311008 + 0 primary mapped (100.00% : N/A) 47312824 + 0 paired in sequencing 23656412 + 0 read1 23656412 + 0 read2 46633346 + 0 properly paired (98.56% : N/A) 47309998 + 0 with itself and mate mapped 1010 + 0 singletons (0.00% : N/A) 524092 + 0 with mate mapped to a different chr 461570 + 0 with mate mapped to a different chr (mapQ>=5)

There's not a very great difference.

mathiasbio commented 8 months ago

Comparing the old bwa mem:


bwa mem \
-t {threads} \
-R {params.bam_header}  \
-M \
-v 1 \
{input.fa} {input.read1} {input.read2} \
| samtools sort -T {params.tmpdir} \
--threads {threads} \
--output-fmt BAM \
-o {output.bamout} - ;

With the new:


{params.sentieon_exec} bwa mem -M \
-R '@RG\\tID:{wildcards.fastq_pattern}\\tSM:{params.sample_type}\\tPL:ILLUMINA' \
-t {threads} \
-K 50000000 \
{input.ref} {input.fastq_r1} {input.fastq_r2} \
| {params.sentieon_exec} util sort \
-o {output.bam_out} \
-t {threads} \
--block_size 3G \
--sam2bam -i -;

I don't see any obvious difference in terms of input parameters, and after comparing the default values set in the "--help" messages for both bwa mem and Sentieon bwa mem, they look identical, as far as I can tell.

mathiasbio commented 8 months ago

One possibility that I can see, if there isn't something else that's hidden in the code for the different (but very similar tools), is the -K value, which we have added to the sentieon bwa mem for reproducibility since running bwa mem without this seems to produce slightly different results when using different threads.

Bwa Mem Have Different Alignment Result When Using Different Threads https://www.biostars.org/p/90390/ Same fastqs and pipeline get different variant results https://www.biostars.org/p/427618/

In v12.0.2 we ran bwa mem without this option, and with 16 threads, now we're running Sentieon bwa mem with 12 threads. Could be that this is what's creating the difference...but it's still odd to me that the effect seems to be consistently more reads that can be used as evidence for SVs....which doesn't really fit with the picture of this being random

mathiasbio commented 8 months ago

It could be something to do with the trimming. I have checked a few of the reads that supported variants called uniquely in V13, and uniquely in V12 (these exist too), and consistently these reads had somewhat different read-lengths. Where in V13 they were more often shorter, and in V12 they were longer where instead these extra bases were soft-clipped.

From V13:

Read name = LH00202:47:222GJCLT4:7:1394:9189:5454:UMI_ATACT_AGGTT
Sample = TUMOR
Read group = 7_171015_222GJCLT4_ACC13676A3_XXXXXX_R
Read length = 142bp
Flags = 145
----------------------
Mapping = Primary @ MAPQ 60
Reference span = chr7:5 551 575-5 551 680 (-) = 106bp
Cigar = 36S106M
Clipping = Left 36 soft
----------------------
Mate is mapped = yes
Mate start = chr5:40426341 (+)
Second in pair
----------------------
SupplementaryAlignments
5:40 426 397-40 426 432 (-) = 35bp @MAPQ 59 NM1
----------------------
MC = 90M55S
NM = 0
AS = 106
XS = 80
Hidden tags: SA, MD, RGLocation = chr7:5 551 579
Base = A @ QV 40

From V12:


Read name = LH00202:47:222GJCLT4:7:1394:9189:5454
Sample = TUMOR
Library = ILLUMINAi
Read group = TUMOR
Read length = 147bp
Flags = 145
----------------------
Mapping = Primary @ MAPQ 60
Reference span = chr7:5 551 575-5 551 681 (-) = 107bp
Cigar = 36S107M4S
Clipping = Left 36 soft; Right 4 soft
----------------------
Mate is mapped = yes
Mate start = chr5:40426340 (+)
Second in pair
----------------------
SupplementaryAlignments
5:40 426 397-40 426 432 (-) = 35bp @MAPQ 59 NM1
----------------------
MC = 4S91M55S
PG = MarkDuplicates
NM = 0
MQ = 60
AS = 107
XS = 80
Hidden tags: SA, MD, RGLocation = chr7:5 551 595
Base = A @ QV 40
mathiasbio commented 8 months ago

To test I will produce the alignment with the V13 settings but using the trimmed output from V12, and then run Manta again on that bamfile.

But I think at the moment that the biggest issue is that we don't have any real filters for the Manta PR:SR, and based on some testing we can greatly reduce the variants by applying even quite relaxed filters such as requiring a minimum of 1% VAF.

Which reduces the number of variants in this case from 15488 to 850.

Which can be further reduced to 100 variants by requiring that at least 5 reads support the variant. I have already implemented this filter in this PR: https://github.com/Clinical-Genomics/BALSAMIC/pull/1371/files

Along with some additional changes.

mathiasbio commented 8 months ago

Regarding the trimming changes between R12 and R13:

R12:


fastp --thread 8 --in1 concatenatedR1.fastq.gz --in2 concatenatedR2 --out1 concatenatedR1.umi_optimized.fastq.gz --out2 concatenatedR2.umi_optimized.fastq.gz --json fastp_umi.json --html fastp_umi.html --overrepresentation_analysis --trim_tail1 1 --n_base_limit 50 --length_required 25 --low_complexity_filter --trim_poly_g --detect_adapter_for_pe

fastp --thread 8 --in1 concatenatedR1.umi_optimized.fastq.gz  --in2 concatenatedR2.umi_optimized.fastq.gz  --out1 concatenatedR1.fp.fastq.gz --out2 concatenatedR2.fp.fastq.gz--json fastp.json --html fastp.html --disable_adapter_trimming --disable_quality_filtering --disable_length_filtering --disable_trim_poly_g --length_required 25 --umi --umi_loc per_read --umi_len 5 --umi_prefix UMI

R13:


fastp --thread 4 --in1 [LANE]_R_1.fastq.gz --in2 [LANE]_R_2.fastq.gz --out1 [LANE]_R_1.umi_removed.fastq.gz --out2 [LANE]_R_2.umi_removed.fastq.gz --json umi_removed_fastp.json --html umi_removed_fastp.html --disable_adapter_trimming --disable_quality_filtering --disable_length_filtering --disable_trim_poly_g --umi --umi_loc per_read --umi_len 5 --umi_prefix UMI --dont_eval_duplication

fastp --thread 4 --in1 [LANE]_R_1.umi_removed.fastq.gz  --in2 [LANE]_R_2.umi_removed.fastq.gz  --out1 [LANE]_R_1.fp.fastq.gz --out2 [LANE]_R_2.fp.fastq.gz --json fastp.json --html fastp.html --overrepresentation_analysis --trim_tail1 1 --n_base_limit 50 --length_required 25 --low_complexity_filter --trim_poly_g --dont_eval_duplication --detect_adapter_for_pe

Differences:

  1. dont_eval_duplication

    --dont_eval_duplication from fastp help don't evaluate duplication rate to save time and use less memory.

We don't need this stat since we're already getting it from Dedup.

  1. the order of the trimming of UMIs is switched around so that UMIs are trimmed before quality-trimming.

  2. Reads are trimmed per lane, instead of concatenated and then trimmed. But that shouldn't matter should it?

That's all I can see.

mathiasbio commented 8 months ago

I have now tried several more things I have tried:

What finally made a difference now has been:

It seems to me now that where we have seen a change is from the parallel alignment...I'm going to make a quick table to summarise all of this.

mathiasbio commented 8 months ago

Created a google sheet in the CG drive: https://docs.google.com/spreadsheets/d/1vWElplkf3BjKxtHBzfHY5XgZ1bkBJveIph6V6TjazlY/edit#gid=664476823

And it turns out that I made a mistake when doing the V12 trimming and actually doing trimming with the V12 settings leads to wildly different results, of around 170 variants.

It seems that due to the odd order of trimming in V12, of trimming the quality and adapters before UMIs, some reads weren't sufficiently hard-clipped, which lead to them being soft-clipped in the aligning.

And as can be read in this issue: https://github.com/Illumina/manta/issues/104 qouted from a Manta developer:

For fragment size estimation, Manta looks for high-quality read pairs using fairly conservative criteria. This is important to get an accurate fragment size distribution. As part of its high-quality criteria, soft clipping is expected and allowed on the “inside” of a paired-read fragment (ie. fwd-strand 50M10S and rev-strand 10S50M are considered okay), soft-clipping on the “outside” of the fragment is not typically expected, so these reads are skipped over. In the above examples, I’m not sure if there’s some type of adapter masking or other pattern causing small portions of the outside of each fragment to be soft-clipped in most cases, but this clipping may be the primary reason this step is failing. If this is adapter sequence clipping, it would be better to remove it from the input sequence entirely instead of soft-clipping. Besides the above issue with fragment size estimation, Manta is going to unroll and attempt to realign all soft-clipped regions in the context of each SV hypothesis, so soft-clipped adapters can cause trouble -- the method knows how to look and account for adapter sequence resulting from a complete read-through of a short fragment, but not something clipped off of the first several cycles.

Manta is sensitive to soft-clipped ends being in the expected location of the reads, and as I had observed by comparing the variants called in the V13 to V12, I saw a difference in the reads used as support for variant calling in V13 had additional soft-clipped ends in V12 which in V13 had been hard-clipped by fastp. Likely then, the current behavior of Manta is the correct one, where more reads are used as evidence for variant-calling, and resulting in more calls as a result, whereas in V12 some reads were likely ignored due to the presence of these illegal soft-clipped ends.

I think in conclusion that we preserve the way we're doing trimming in V13, and we continue to fix the issue of extra Manta variants by implementing filters!

Thanks a lot @fevac for helping me clear this up! 🌟

fevac commented 8 months ago

We quantified the amount of soft-clipping on the “outside” of the fragment (this is, alignments starting or ending with soft-clipped bases) for v12 and v13:

fevac commented 8 months ago

Also, maybe worth mentioning it here, this difference in number of variants should only be happening in cases with UMI sequences

mathiasbio commented 8 months ago

As a user-story has been created with an implementation to solve the issue investigated in this assessment I'm closing this