AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
127 stars 55 forks source link

Description of any thresholds/limits used in SV/DUP detection? #308

Open tfenne opened 4 years ago

tfenne commented 4 years ago

I'm trying to use VarDictJava 1.7.0 to call a duplication in some hybrid capture data. I'm attempting to establish a lower limit of detection both in the number of DUP-informative reads and the allele fraction.

I have a sample with approximately 5% allele frequency in very deep coverage (> 1000X) and VDJ calls it handily. I've written some code to separate all the ALT-carrying read pairs from the REF-carrying ones (knowing a priori the allele I'm looking for), and then mix back in a set number of the ALT reads to the REF reads. When I do this I get some surprising results. The calling command I'm using looks like this:

    VarDictJava/build/install/VarDict/bin/VarDict \
      -b ${bam} \
      -G ${genome} \
      -f 0.00001 \
      -r 1 \
      -L 1500 \
      -P 0 \
      -F 0 \
      -Q 0 \
      -q 10 \
      -m 50 \
      -o 0 \
      --chimeric \
      -R ${region} \
      | fgrep -v SNV \

As you can see I'm trying to remove as many thresholds as possible. What I see in my tests is that if I only have 1 DUP-supporting read pair (with the breakpoint in the middle of both R1 and R2) I get no call. Same with 2 DUP-supporting read pairs. When I step up to 3 supporting read-pairs I get calls some of the time, though which sets of 3 reads generate calls and which don't doesn't make great sense to me - some sets that appear to my eye to have better support don't make calls, while some with much less support do.

My guess (hope?) is that there are probably one or more heuristics and thresholds in the SV calling module that control which reads are examined and what the minimum amount of evidence is to make a call. But I can't find any documentation for this. I would be very grateful if someone with in depth knowledge of the SV module could comment.

PolinaBevad commented 4 years ago

Hi Tim,

I guess there can be at least 2 reasons:

  1. We support not the all cases for duplication at the moment really, you can see: https://github.com/AstraZeneca-NGS/VarDictJava/blob/03eb5432195673ace20f700c26e1d6ed90736402/src/main/java/com/astrazeneca/vardict/modules/StructuralVariantsProcessor.java#L765 and https://github.com/AstraZeneca-NGS/VarDictJava/blob/03eb5432195673ace20f700c26e1d6ed90736402/src/main/java/com/astrazeneca/vardict/modules/StructuralVariantsProcessor.java#L901
  2. Part of DUP SV's in case of discordant pairs can be filtered out due to this threshold on quality, but we don't have an option for it, so please feel free to change it in code: https://github.com/AstraZeneca-NGS/VarDictJava/blob/b7721796006c92bd2e2759d33840ffbef1972b62/src/main/java/com/astrazeneca/vardict/Configuration.java#L276

Also, can you try it with -p option? This will remove the standard variant filtering (https://github.com/AstraZeneca-NGS/VarDictJava#variant-filtering) and make some kind of "gVCF" result. I typically use it to check if variant was filtered out by our hard filters (yes... I remember #62).

If nothing helps maybe you can share some BAM slices where DUP wasn't found so we can improve algorithm on our missed SV parts?

Thank you!