genome / pindel

Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
GNU General Public License v3.0
162 stars 89 forks source link

Possible enhancement for VAF estimation #86

Open philip-holmgren opened 6 years ago

philip-holmgren commented 6 years ago

Hi,

I came across this tool because we are developing an inhouse NGS pipeline for FL3-ITD detection detection in AML patients. We already use NGS for this but it's outsourced and quite costly so we're looking into an alternative we can set up ourselves.

Based on an initial validation cohort (17 FLT3-ITD+ AML, 23 FLT3-ITD- AML patients) Pindel does very well in detecting all ITD's witch matching ITD lengths for all positive samples.

However, we see a marked difference in Variant Allele Fraction (VAF) compared to the other two techniques. Although fragment analysis might not be very accurate to determine this, we were surprised another NGS analysis shows a much bigger VAF in several FLT3-ITD cases, sometimes with a 50% decrease of VAF detected by Pindel. This has also been observed in other studies, comparing different tools to detect FLT3-ITD in the literature (e.g. Rustagi et al., https://www.ncbi.nlm.nih.gov/pubmed/27121965).

To understand what causes the difference we used a software package (SeqPilot, JSI) to detect the FLT3-ITD in 1 of our positive patients. Although SeqPilot is not really good in detecting FLT3-ITD, in this case the 21bp indel was probably small enough to be mapped and called correctly. Similar to the outsourced analysis, SeqPilot showed a much higher VAF than Pindel for this variant.

To understand the difference we extracted the variant reads in SeqPilot and compared them with the reads in the Pindel output.

Pindel called the 21bp with a VAF of 30% (ADREF=5122;ADALT=2227) whereas SeqPilot called the same variant with 40% (ADREF=6411; ADALT=4274) We are not entirely sure where the difference in total coverage comes from (perhaps the alignment of SeqPilot is less strict than when we perform it with BWA-MEM) but we focused on the coverage of the ITD allele to explain the ~2000 coverage difference in ADALT.

By comparing header read information we got the following results: Read,unique to Pindel,Both, unique to SeqPilot R1,22,897,1308 R2,76,1232,837 Same header,76,897,9

If we compare the read1 intersect or read2 intersect individually, we notice SeqPilot has much more read in R1 as well as R2 calling the variant. However if we simply look at the originating sequence fragment (same FASTQ header disregarding R1/R2) we notice almost all fragments are used to detect the variant using both tools. The major difference is that SeqPilot often detects the variant in both reads whereas Pindel only calls it in either one both not both.

Based on how Pindel works this would make sense: by using one as the mapped read to get the anchor point and trying to map the unmapped read and find the proper breakpoints. However, in a lot of cases the mapped read (Pindel's persepective) is only partially mapped and soft-clipped toward the end (at least for the reads I investigated).

Based on these findings we propose that VAF could be estimated more accurately if Pindel checked for the presence of the variant in the mapped read.

I could provide the sample data if that would help.

Kind regards