Closed Madelinehazel closed 6 years ago
Hello Madeline,
The short answer is coverage is your culprit.
Lumpy (and pindel I think) use paired-ends to call SVs. CNVator uses coverage, but applies a sophisticated algorithm to measure deviations in coverage.
SV2 on the other hand uses both paired-ends and coverage to genotype SVs. However, the coverage estimate does not measure deviations from coverage, instead it calculates an adjusted coverage value according to the chromosomal median. SV2 also masks SVs that are in problematic regions: segmental duplications, STRs, unmappable regions, centromeres, and telomeres. So if you provide a locus to genotype that's 10kb and 2kb of the locus is in an unmappable region, only the 8kb region outside the excluded region is considered when calculating coverage.
Let's take a look at the features: 0/0:2.32:0.07:0.00:0.35:9:nan:0:5:0.69,0.31,0.00
The first CN
feature is an estimate of copy-number from adjusted coverage. Here it says the region has 2.3 copies. This is the likely reason as to why SV2 genotypes this region as REF. There are some discordant paired-ends (PE
) which provide some evidence to the call. Likewise the SNP coverage (SC
) has an adjusted overage of 0.35 (0.7 copies), which might indicate the region is deleted, but it's only 9 SNPs. The likelihood scores for 0/0 and 0/1 genotypes are 0.7 and 0.3 respectively, so it's not a confident call.
Looking at the locus in UCSC genome browser
The region is almost entirely in a LINE repeat, which is likely causing the PE
signal. Likewise (very common for LINEs), the region is very unmappable. So when SV2 estimated the coverage of the region, it excluded all the gaps in the unmappable track in the image. I didn't apply a RepeatMasker mask, since that would exclude a lot of the genome, so the coverage estimates for this locus included some of the LINE that has mappable tracks. Since there are a lot of LINEs in the genome, more reads are probably aligning in that locus than you would expect which is elevating the coverage.
So is the region deleted? What's the MAPQ of the reads that are aligning there? Your IGV image indicates there is a likely deletion, but if the reads mapping there have less than 10 or 5 MAPQ, it's probably a false positive. If the MAPQ is high, then it's likely real.
In my experience LUMPY will call SVs in many LINEs and we remove those since most intersect the unmappable track by over 66% the length of the SV. As for CNVator, does it predict the region to be deleted with high confidence? I'm not familiar with how it performs in LINEs and other repeats.
Filters:
PASS: variant passes standard likelihood filters. This is determined by the median likelihood for all genotyped samples in your cohort
FAIL: variant fails standard likelihood filters.
NOALT: None of the samples have an ALT allele (determined by SV2)
GENOTYPEFAIL: The SV in question either entirely overlaps excluded regions and/or the coverage estimates are over 10 diploid copies. We excluded SVs with high coverage since during training we had poor performance.
Hello,
I have a ~7 kb high confidence deletion (called by pindel, CNVnator, and LUMPY) in a sample that is not being genotyped correctly by SV2. It is given a filter of 'NOALT', and a reference genotype of 0/0. See attached IGV screenshot.
The line from the SV2 vcf: 2 84888836 . .
NA NOALT END=84896512;SVTYPE=DEL;SVLEN=7677;DENOVO_FILTER=NA;REF_GTL=5;AF=0.0; GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL 0/0:2.32:0.07:0.00:0.35:9:nan:0:5:0.69,0.31,0.00My script:
sv2 \ -i $BAMDIR/$BAM \ -v $WORKDIR/${SAMPLE}Out/variants.vcf.gz \ -snv $BAMDIR/${SAMPLE}_haplotypecaller_split_norm_filter.vcf.gz \ -p $WORKDIR/${SAMPLE}.ped \ -O $WORKDIR/SV2 \ -o ${SAMPLE}_80merged \ -merge \ -no-anno
Why would SV2 be unable to genotype this one?
Also, can you explain the filters (PASS, FAIL, NOALT, GENOTYPEFAIL)?
Thank you! Madeline