AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
129 stars 55 forks source link

Amplicon Mode for QIAseq Designs #301

Open crutching opened 4 years ago

crutching commented 4 years ago

Hi! Thanks for the effort on this tool, definitely want it to work for us. We are utilizing the QIAseq primer extension design, meaning there is only a single primer region we are concerned about. I have managed to get this to work, I think, but want to make sure what I am doing is appropriate.

My BED file looks like this:

3   178951824   178952073   .   .   .   178951857   178952073
3   178951841   178952090   .   .   .   178951875   178952090
3   178951923   178952172   .   .   .   178951959   178952172
3   178952008   178952257   .   .   .   178952041   178952257
3   178951873   178952122   .   .   .   178951873   178952094
3   178951970   178952219   .   .   .   178951970   178952191

As you can see, the value in column 7 corresponds to the end position of the primer region. The value in column 8 is exactly the same as the region end position in column 3, since only have a single primer. Both values in column 2 and 7 are 0-based, other two are 1-based. Will this do what I think it is doing?

PolinaBevad commented 4 years ago

Hi @jhl667,

Yes, it must work in the way you want: VarDict will flag variants overlap with amplicon's primer as "bad". We have description here: https://github.com/AstraZeneca-NGS/VarDictJava#amplicon-based-calling

You may also want to look at -a option to tune overlapping parameters if you want!

crutching commented 4 years ago

Thank you very much for the response! I have since found this is not quite going to work as the code is currently written. First problem was null pointer exceptions from having regionEnd == insertEnd. I was able to make the insert coordinates one less, and that problem was solved. For example:

3 178951824 178952073 . . . 178951857 178952072

Next problem is the following code at line 1357 CigarParser.java:

if ((abs(segstart - region.start) > distanceToAmplicon || abs(segend - region.end) > distanceToAmplicon) || abs((ts1 - te1) / (double)(segend - segstart)) <= overlapFraction) {
    return true;
}

This is comparing against both the end and beginning of the region, then filtering out reads when they aren't within distanceToAmplicon of either. For our single-primer extension design, this will filter out a very large proportion of our reads. I was able to test that this was the case by looking at known regions with 2000x depth and getting only 50x calls at the default distance of 10. Increasing this distance value to around 200 picks up almost everything, but then I am not sure if there are problems assigning variants in the case of overlapping amplicons.

So, I can change the code above pretty easily to only filter reads if they are within distanceToAmplicon of the start coordinate, while ignoring the end coordinate, and vice versa. Something like:

if (((abs(segstart - region.start) <= distanceToAmplicon) || ((abs(segend - region.end) <= distanceToAmplicon) {
    return false;
}

The problem is, I'm not sure what the overall impact of this type of change is on this code. For instance, the amplicon output contains extra columns that list "Good" and "Bad" amplicons. How are these determined? Will we still be able to properly assign variant calls to reads originating from a specific single primer?

crutching commented 4 years ago

Opened https://github.com/AstraZeneca-NGS/VarDictJava/pull/303

PolinaBevad commented 4 years ago

Hi @jhl667, Thank you for adding PR. For some reason it was failed on tests, so I'll take a look at it this week!

We determine good/bad variants in AmpliconPostProcessModule: https://github.com/AstraZeneca-NGS/VarDictJava/blob/master/src/main/java/com/astrazeneca/vardict/postprocessmodules/AmpliconPostProcessModule.java#L84. So, this is a check if the variant meets the filter thresholds. We also check amplicons for presence of variants: if a variant covers several amplicons and it was not called in all amplicons, there will be AMPBIAS.

crutching commented 4 years ago

Let me see if I can describe what is confusing me. We have the following variant:

foo .   7   55259515    55259515    T   G   437 20  56  361 4   16  T/G 0.0458  2;2 32.4    1   45.0    1   60.0    40.000  0.0450.0023 0   2.000   1   1.1 20  437 CAAGATCACAGATTTTGGGC    GGCCAAACTGCTGGGTGCGG    7:55259358-55259607 SNV 4   4   0   0   G:20:F-4:R-16:0.0458:2:32.4:1:45.0:1:0.0458:60.0:40.000 & T:417:F-56:R-361:0.9542:2:33.7:1:45.0:1:0.9542:60.0:834.000   
Good0 437 20 56 361 4 16 T/G 0.0458 2;2 32.4 1 45.0 1 60.0 40.000 0.0458 0.0023 7:55259358-55259607 
Good1 437 20 56 361 4 16 T/G 0.0458 2;2 32.4 1 45.0 1 60.0 40.000 0.0458 0.0023 7:55259332-55259582 
Good2 437 20 56 361 4 16 T/G 0.0458 2;2 32.4 1 45.0 1 60.0 40.000 0.0458 0.0023 7:55259353-55259603 
Good3 438 20 56 362 4 16 T/G 0.0457 2;2 32.4 1 45.0 1 60.0 40.000 0.0457 0.0023 7:55259475-55259725

I would expect the listed depths (Good0-Good3) to be quite different from amplicon to amplicon. In fact, the amplicon listed at 7:55259332-55259582 does not even contain the variant if I were to do the following steps:

samtools view -b my_bam.bam 7:55259327-55259337 > test1.bam
samtools index test1.bam
samtools mpileup test1.bam -r 7:55259515-55259515

Output is: 7 55259515 N 9 ttttttttt NNNNNNNNN If I look at another amplicon (7:55259475-55259725) where the expected variant is in a better position, I can see the call clearly with this method:

samtools view -b my_bam.bam 7:55259470-55259480 > test2.bam
samtools index test2.bam
samtools mpileup test2.bam -r 7:55259515-55259515

Output: 7 55259515 N 401 tttttttttgtttttgttttttttttttgttttTTTTTTTTTTTTTGTTTTTTTt$t$tttttttttttttttttttttttttttttttttttttttttttttttttttttttttgttttttttttttttttttgtttttttttttttgttttttttttgtttttttttttttttttttttttttttttttgttttttttttttttttgtttttttttttttttttttttttttttttttttttttttttttttttttttttttttttgttttttgtttttttttttttttttttttttttttttttttgtttttttttttttttttttttttttttgtttttgttttttgttttttttttttttttttttttttttttttttttttttTttttttttgttTt NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNMNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN So, I don't think these per-amplicon depth statistics are being created in the way I am imagining they are the most useful. I would think you would want to look at the region immediately around the start point of the primer, and include enough flanking sequence on each side to satisfy the distance value specified with the -a option. Is this the intended behavior, or is my data not being prepared correctly?

gudeqing commented 4 years ago

hi, @jhl667 I see that you are also dealing with data from Amplicon Mode. I am curious what if we simply remove designed primer from each read and calling variants in a normal way with target bed set? I have the following reason to do so: In a real situation, a primer could target a wrong region and generate unexpected reads. And, with a primer contained, a read is prone to be aligned to the place where the primer comes from. And thus may lead to wrong variant calling especially when we aim to do low-frequency variant calling. However, every coin has two-sides, a primer could also help a target read to be aligned to the right place especially when our read is very short which may cause it being multiple aligned. So, what is your opinion?

Best wishes!