FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

Question about homozygous SNP ? #6

Closed hxlei closed 7 years ago

hxlei commented 7 years ago

I am a little confused with "homozygous SNP". In SNPsplit paper, It's said that "If the genotype is not known, SNP positions may be called from the data itself, or from genome re-sequencing performed in parallel". For example,I have bs-seq data for a monkey as well as her genome re-sequencing data. The genotype is not known. Can SNPsplit genome preparation still be used? SNPsplit genome preparation uses only high confidence homozygous positions, which confuses me. I guess it is heterozygous SNP rather than homozygous SNP that can be used to assign reads in this case. In the case reciprocal mouse crosses reported by Xie and colleague (GSE33722), I understand the necessity of homogeneous SNP. Thanks for any suggestion.

FelixKrueger commented 7 years ago

Hi hxlei,

I know it can be quite confusing… In terms of the mouse genomes project we are only using high-confidence homozygous SNPs of a genome relative to the Black 6 reference, so that if we are looking at a hybrid strain such as Black6 // Castaneus we can be certain that a SNP really came from either the Black6 reference or the Castaneus alternative strain. The hybrid strain itself is of course heterozygous as these positions.

In cases that do not involve clean parental genotypes it is much more complicated, and this is probably when SNPsplit is not the tool of choice to be honest, but I’ll try to outline the problem anyway. From re-sequencing a genome you could still get a list of SNPs, and use these SNPs for N-masking a genome. Technically there is no reason why SNPsplit should not work in this scenario, however if you look at the example attached you will see that you will have a hard time calling allele-specific effects in the absence of knowing the exact haplotypes.

In the example all reads overlapping SNP 1 seem to contain ‘A’ at the position, so the reads would be sorted into Genome 1. Reads overlapping SNP 2 contain all have a ‘T’ at the SNP position and would thus be sorted into Genome 2. Since you don’t know if the T at this position really came from Genome 1 or Genome 2 you could potentially get a more or less random mix of Genome 1/Genome 2 assignments which makes it impossible to call allele-specific effects over larger distances. It might still be possible for individual SNP positions to detect allelic imbalance, but in the absence of long range haplotype phasing this might be the limitation you would have to live with. This is why we mainly advocate the use of SNPsplit with clean parental phenotypes such as with inbred mouse crosses. I hope this helps a little? g4144

hxlei commented 7 years ago

Thank you very much ! Maybe I could use some softwares to get long range haplotype phasing information then make use of SNPsplit?

FelixKrueger commented 7 years ago

Yes that might help, I'm just coming back from a 10X Genomics seminar, looking very promising for phasing information!

hxlei commented 7 years ago

Thanks !