False negative duplication due to low MAPQ

brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.

Apache License 2.0

222 stars 21 forks source link

False negative duplication due to low MAPQ #178

Open lee039 opened 2 years ago

lee039 commented 2 years ago

Hi,

I ran Smoove using a population calling mode. There is a duplication shared by >10 samples however, Smoove detected this in none of the samples likely due to low mapping quality (Breakpoints are flanked by repeats). This is how the duplication looks like in IGV:

The green reads support the duplication. The depth is also elevated compared to the flanking region which also supports duplication. However, when I only visualize reads with MAPQ > 40:

Then I am left with only one discordant read-pair which is not sufficient for Lumpy to call a duplication. Do you have any advice on how can I force Lumpy to detect this duplication?

Thank you very much in advance!

brentp commented 2 years ago

Hi, smoove uses a mapping quality cutoff of 20. This is not a parameter, you could modify the value in lumpy/depthfilter.go (MinMapQaulity) and recompile and run.

lee039 commented 2 years ago

Hello,

Thanks for the quick answer! When I checked the MAPQ>20, I see 2 discordant read pairs.

Do you think these are sufficient to call a duplication? In my Smoove output, all the calls required min. 4 read-pairs to support a call. Should I also modify the minimum reads support somewhere?

brentp commented 2 years ago

yes, you can lower the support (which defaults to 4) with, e.g. --support 2 to smoove call, but note that this will give a lot more false positives.

lee039 commented 2 years ago

Hi again,

I found a sample that has three discordant read pairs with high MAPQ (both forward and reverse reads MAPQ >50). So, I ran Smoove call with -support 1, without changing MinMapQaulity.

However, Smoove still does not detect this duplication. Do you have any other suggestions for this case?

brentp commented 2 years ago

No, sometimes this happens, unfortunately. It might be due to the clustering of the data or the orientaiton of the reads. Also make sure that any intermediate results that smoove had cached are cleared on before you re-run with changed parameters.

lee039 commented 2 years ago

That is a pity. Because the orientation of the reads must be interpreted as a duplication. Also, the reads seem to be clustered well...although there is no split-reads support (because breakpoints are flanked with repeats).

I hoped that Lumpy could detect it, but it does not... Do you know whether there is a way to make a simulated bam file including a duplication, such that Lumpy will detect it certainly....?

My intention is that anyways I won't be able to detect the breakpoints accurately. However, it would be useful for me to have Duphold output for my samples nevertheless. It will at least give a rough approximation of the duplication carriers.

brentp commented 2 years ago

you could try running smoove with --no-extra-filters (I can't remember if it's that or --noextrafilters). That is even more lenient. You could also try running lumpy directly.

Otherwise you can simply create a fake VCF containing the variant of interest (doesn't need to be exact) and you can send that to duphold for each sample.