Illumina / Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.
GNU General Public License v3.0
92 stars 16 forks source link

RMxN filter #39

Closed saty89 closed 5 years ago

saty89 commented 5 years ago

Hi, I am trying to understand how this filter is being applied. According to the docs - RMxN Filter: This filter filters indels that are in sections of the genome with repeats of length [1 to M], repeated >= N times. By default, M=5, and N=9.

Below is the example in one sample -

10 104854004 . T TA 100 PASS DP=385 GT:GQ:AD:DP:VF:NL:SB 0/1:100:248,137:385:0.356:20:-98.0037 10 104854004 . T TAA 22 q30;R5x9 DP=385 GT:GQ:AD:DP:VF:NL:SB 0/1:22:375,10:385:0.026:20:-6.8438 10 104854004 . TA T 27 q30;R5x9 DP=382 GT:GQ:AD:DP:VF:NL:SB 0/1:27:371,11:382:0.029:20:-6.5294

Reference genome at this location and next few bases is 'TAAAAAAAAAAAATT' which if I understood correctly means M=1(A) and N=12.
If so, even the first line T>TA should have been flagged as R5x9.

Please correct me if I got this wrong and thanks in advance for your help!

saty89 commented 5 years ago

Oh on the same issue, I noticed that the same REF:ALT combination which was PASS in the above example has now been marked with "R5x9" filter in a different sample. 10 104854004 . T TA 100 R5x9 DP=407 GT:GQ:AD:DP:VF:NL:SB 0/1:100:281,126:407:0.310:20:-100.0000 10 104854004 . T TAA 72 R5x9 DP=407 GT:GQ:AD:DP:VF:NL:SB 0/1:72:388,19:407:0.047:20:-21.6137 10 104854004 . TA T 21 q30;R5x9 DP=405 GT:GQ:AD:DP:VF:NL:SB 0/1:21:395,10:405:0.025:20:-12.2147

So wondering to filter with particular filter is it considering additional metrics/filters ?

Thanks!

tamsen commented 5 years ago

Hi Satwica,

Thanks for asking.

So there is a frequency component to the RMxN filter, too. Typical thresholds we have used are 20% and 35%. (35% being the current favorite across our most recent data sets). So the filter only kicks in for variants with frequency less than the threshold AND being in the repeat context of the reference genome AND having the specific local repeat element either appended or deleted to the called allele.

Your exact parameters used (depends on for the version you have + command line you used) will be stored in your options used xml in your logs folder.

There is a longer discussion of this filter in the supplemental section of the Pisces paper https://academic.oup.com/bioinformatics/article/35/9/1579/5124278

thanks again for your interest, Tamsen

saty89 commented 5 years ago

Hi Tamsen, ahh thanks for reminding about the additional frequency component. It completely slipped from my mind. And yes I think the 35% is what was set and so now it makes sense why the one variant has passed and not the others. And also, thanks for pointing me to the documentation I will have a read.

Thanks for all your help! Satwica