GATB / DiscoSnp

DiscoSnp is designed for discovering all kinds of SNPs (not only isolated ones), as well as insertions and deletions, from raw set(s) of reads.
https://gatb.inria.fr/software/discosnp/
GNU Affero General Public License v3.0
38 stars 20 forks source link

Equivalent information between discosnp and STACKS2 #38

Closed clorenzo1 closed 1 year ago

clorenzo1 commented 1 year ago

Hi there,

I am using DiscoSnpRAD to explore the structure and demography of my target organism. For the demography, I have used the dadi software on an SFS and would like to convert the results of my runs into biologically meaningful values.

The output of STACKS2, gives you the total number of sites, the total number of variant sites, and which of those are polymorphic. Does the total number of variant sites from stacks translate into the total number of variant bubbles from DiscoSnpRAD? If not, where can I find that equivalent information?

Cheers!

clemaitre commented 1 year ago

Hi,

The total number of bubbles does not represent the total number of variant sites, since a single bubble can contain several SNPs (when they are phased). The number of lines in the output VCF should better reflect the total number of variant sites.

Best, Claire

clorenzo1 commented 1 year ago

Hi Claire,

Thanks for the prompt reply. Yes, that's clear to me.

Would it therefore make sense, in order to get all variable sites (as in STACKS2, which differs from polymorphic sites), to set my allowed number of SNPs per bubble to the length of my kmer, such that it can find all possible SNPs within the optimal kmer length?

Best, Coral

clemaitre commented 1 year ago

Hi Coral,

I am not sure to fully understand your question... The maximal allowed number of SNPs in a bubble is not related to the kmer-size, it can even be larger than the kmer size. Two SNPs belong to the same bubble if they are distant from less than k nucleotides and the two pairs of alleles are phased in the datasets. You can have many successive SNPs that respect this condition forming a large bubble.

If this does not answer your question, could you precise it ?

Best, Claire

clorenzo1 commented 1 year ago

Hi Claire,

To clarify, I need to find the total maximum SNPs phased, or unphased, in my dataset. Is there a way to do this?

Best wishes Coral

clemaitre commented 1 year ago

Hi Coral,

I am not sure to understand what you mean by "maximum". The total number of SNPs found by discoSnp is the number of lines in the VCF (phased SNPs are output in several lines in the VCF, one SNP per line).

Best, Claire

clorenzo1 commented 1 year ago

Hi Claire,

Sorry for the confusion. Let me try to clarify:

I am aware that the total number of SNPs found by discoSNP can be counted in the number of lines in the VCF. My question is about setting the parameters to reach that number before running discoSNP.

I can set the following parameter:

P = the number of polymorphisms per bubble,

to a value of my choice. Depending on this value it seems I can call more or fewer SNPs, is that correct? This therefore must be a parameter that controls the number of SNPs found by discoSNP, is that right? Is there a limit to this value when setting the parameters?

I hope this is clearer! :)

Best, Coral

clemaitre commented 1 year ago

Hi,

the -P parameter impacts the number of SNPs found, but only marginally because it concerns a very particular type of bubbles that are much less abundant in de Bruijn graphs : sequences of N SNPs distant from one another from less than k nucleotides and that are phased in all samples, that is we can find only 2 distinct combinations of alleles for these N variants in all samples. When the number of samples is large, this is unlikely to get many such bubbles for large N values. If more than 2 combinations of alleles exists, these closely located SNPs are still found by discoSNP but as distinct "regular" bubbles.

Best, Claire

clorenzo1 commented 1 year ago

Hi Claire,

Ok, thank you for clarifying. My next question is whether nucleotide diversity or other such molecular diversity parameters can be calculated within discosnp?

Cheers, Coral