FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

Add --high_confidence option for dual hybrid genomes #9

Closed FelixKrueger closed 7 years ago

FelixKrueger commented 7 years ago

We have come across certain position in the genome where different strains appear to have the same SNP (indicated by the GT/genotype field), but one of the strains failed the FI/FILTER criterium (1 is PASS, 0 is FAIL). Here is an example:

GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI 1/1:22:6:0.166667:152,22,0:137,18,0:2:36:6:0,0,6,0:0:-0.616816:.:1 (129) 1/1:15:4:0:79,15,0:67,12,0:2:24:4:0,0,4,0:0:-0.556411:.:0 (Cast)

For single hybrid genomes we would include this position into the 129 genome (1/1 homozygous SNP, first line), but would ignore the position for the Cast genome (also 1/1 homozygous SNP, but failed the high confidence FI filter, second line). This seems like a reasonable approach.

For dual hybrid genomes such positions might be a problem though because when the 129 and Cast SNP lists are compared with each other it looks like there is now a SNP between 129 and Cast, even though there was evidence that the genotype was the same (1/1) in and Cast, only that it did not pass the threshold to count as high confidence SNP in Cast.

As a solution to this can we change the SNPsplit genome preparation to store the FI value as well as the GT genotype and only use the position for a dual-hybrid SNP list if the position was measured with high confidence (i.e. FI=1) in both strains? Thanks to @nservant for helpful discussions in this regard.

FelixKrueger commented 7 years ago

I have now tried to add functionality for the --dual_hybrid mode to identify positions where both genomes had homozygous SNPs compared to the reference but where one strain did not pass the high confidence filters. Instead of making this a new option this is now the default behaviour since I believe this is the right thing to do. Addressed 210af817b2ca2c3681226896381a119630a7a6f9 and 1ab9048f92fb475eea68e59804085deb7be4d382.

FelixKrueger commented 7 years ago

In addition to high confidence homozygous SNP positions we also see some cases of low confidence no-SNP positions, such as this one: GT:GQ:DP:MQ0F:GP:PL:AN:MQ:DV:DP4:SP:SGB:PV4:FI 1/1:21:12:0:152,21,0:128,12,0:2:55:9:3,0,7,2:0:-0.662043:.:1 0/0:.:5:0:.,.,.:.,.,.:2:47:4:1,0,4,0:0:-0.556411:.:0

In line with only including high-confidence positions for the allele-specific analysis I have now added an additional check so that both FI fields need to have passed the filter (i.e. FI=1) irrespective of the genotype (which may e.g. be 0/0, 0/1 or ./.). This addition requires some additional memory compared to the original version but will make the genome preparation more robust.

Addressed in c9688d92e7c543de262ce8eaef96b87a6b7585eb and 481a4605332d3580f231ad0cf8e6dc6f937b343d.