dantaki / SV2

Support Vector Structural Variation Genotyper
58 stars 11 forks source link

NOALT filter #14

Closed Madelinehazel closed 6 years ago

Madelinehazel commented 6 years ago
dnah6_del

Hello,

I have a ~7 kb high confidence deletion (called by pindel, CNVnator, and LUMPY) in a sample that is not being genotyped correctly by SV2. It is given a filter of 'NOALT', and a reference genotype of 0/0. See attached IGV screenshot.

The line from the SV2 vcf: 2 84888836 . . NA NOALT END=84896512;SVTYPE=DEL;SVLEN=7677;DENOVO_FILTER=NA;REF_GTL=5;AF=0.0; GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL 0/0:2.32:0.07:0.00:0.35:9:nan:0:5:0.69,0.31,0.00

My script: sv2 \ -i $BAMDIR/$BAM \ -v $WORKDIR/${SAMPLE}Out/variants.vcf.gz \ -snv $BAMDIR/${SAMPLE}_haplotypecaller_split_norm_filter.vcf.gz \ -p $WORKDIR/${SAMPLE}.ped \ -O $WORKDIR/SV2 \ -o ${SAMPLE}_80merged \ -merge \ -no-anno

Why would SV2 be unable to genotype this one?

Also, can you explain the filters (PASS, FAIL, NOALT, GENOTYPEFAIL)?

Thank you! Madeline

dantaki commented 6 years ago

Hello Madeline,

The short answer is coverage is your culprit.

Lumpy (and pindel I think) use paired-ends to call SVs. CNVator uses coverage, but applies a sophisticated algorithm to measure deviations in coverage.

SV2 on the other hand uses both paired-ends and coverage to genotype SVs. However, the coverage estimate does not measure deviations from coverage, instead it calculates an adjusted coverage value according to the chromosomal median. SV2 also masks SVs that are in problematic regions: segmental duplications, STRs, unmappable regions, centromeres, and telomeres. So if you provide a locus to genotype that's 10kb and 2kb of the locus is in an unmappable region, only the 8kb region outside the excluded region is considered when calculating coverage.

Let's take a look at the features: 0/0:2.32:0.07:0.00:0.35:9:nan:0:5:0.69,0.31,0.00

The first CN feature is an estimate of copy-number from adjusted coverage. Here it says the region has 2.3 copies. This is the likely reason as to why SV2 genotypes this region as REF. There are some discordant paired-ends (PE) which provide some evidence to the call. Likewise the SNP coverage (SC) has an adjusted overage of 0.35 (0.7 copies), which might indicate the region is deleted, but it's only 9 SNPs. The likelihood scores for 0/0 and 0/1 genotypes are 0.7 and 0.3 respectively, so it's not a confident call.

Looking at the locus in UCSC genome browser hgt_genome_3634_ea9f30

The region is almost entirely in a LINE repeat, which is likely causing the PE signal. Likewise (very common for LINEs), the region is very unmappable. So when SV2 estimated the coverage of the region, it excluded all the gaps in the unmappable track in the image. I didn't apply a RepeatMasker mask, since that would exclude a lot of the genome, so the coverage estimates for this locus included some of the LINE that has mappable tracks. Since there are a lot of LINEs in the genome, more reads are probably aligning in that locus than you would expect which is elevating the coverage.

So is the region deleted? What's the MAPQ of the reads that are aligning there? Your IGV image indicates there is a likely deletion, but if the reads mapping there have less than 10 or 5 MAPQ, it's probably a false positive. If the MAPQ is high, then it's likely real.

In my experience LUMPY will call SVs in many LINEs and we remove those since most intersect the unmappable track by over 66% the length of the SV. As for CNVator, does it predict the region to be deleted with high confidence? I'm not familiar with how it performs in LINEs and other repeats.

Filters: