Open tavareshugo opened 6 years ago
Hi Hugo,
Thanks for the detailed posts. It's been fun to collaborate and think on this with you.
Regardless of the ploidy levels. The rbinom
solution is certainly faster and I used it in the SNP index simulations. I guess it eluded me for some reason in the allele freq sim. The way I went was kind of derived from the original slower code in the Tagaki et al scripts, which randomly sampled from a uniform dist and then asked if it was higher or lower than 0.5 to decide the allele.
I will probably incorporate the rbinom
solution alongside the changes suggested in #5, just for the sake of speed.
In regard to polyploidy, and though I am not a polyploid expert the way you have it set up makes sense to me as is. However, I am aware that there are cases in autopolyploids that can behave in different ways such as multivalent pairing etc.
I will be in touch with Dr. Pat Edger in our dept who is our resident polyploidy expert in the coming weeks and see if there is a way to integrate a more accurate representation of the allele frequencies.
Maybe this is overkill and the method you suggested is perfect for the kinds of studies QTLseqr is for, I am just not confident enough about the polyploidy mess...
Let me know your thoughts.
Ben
Hi Ben,
I'm no polyploidy expert either, so it seems sensible to discuss with someone more familiar with polyploid genetics! I suppose the way I put it made two assumptions:
If that's not true, then the model would be wrong, I guess. As you say, a more "realistic" model might be hard, because it probably depends on homology between chromosome copies, which will probably vary between chromosomes, individuals, varieties, species, etc...
I suppose if the function was made general, this could be explicitly mentioned in the documentation and it would be up to the user to decide. Also, if the default ploidy = 2, then a substantial number of users don't even have to worry about this. :smile:
Im glad you are thinking about this. We have been trying QTLSeqR in some pooled data of hybrid backcross autotetraploids segregating for a major gene where the minor allele frequency of interest is 0.25. Results look sensible (similar to Popoolation CMH but sharper!) but Im interested how default expectations might not fit our system.
Has anyone ever followed up on this? I would be very interested in learning what code modifications might be employed to facilitate better default modeling when running QTLSeqrR with a tetraploid species.
In #5 I asked about the
bulkSize
option inrunQTLseqAnalysis()
if:The answer is no,
bulkSize
should be the number of diploid individuals. I see this is right, because of the way the null expectation is being simulated.I guess there are two levels to the simulation, because there are two levels of sampling:
simulateAlleleFreq()
function).simulateSNPindex()
function).I hand't noticed but indeed the first level of the simulation is assuming individuals are diploid, because it samples diploid individual genotypes (
c(0, 0.5, 1)
with probabilities relating to the expected segregation ratios in an F2 (c(1, 2, 1)/4
).But what if one is working with higher ploidy? Then the above simulation would not work.
However, the way it is implement at the moment, I think is equivalent to sampling from a binomial with probability of the event (picking an alternative allele) being 0.5 and number of trials being equal to the number of alleles sampled (2 x number of individuals).
To illustrate with code:
I guess the advantage is that this is general, regardless of the ploidy (besides the bonus of being faster).
I think the RIL implementation is already general as it is, because in that case we assume the individuals are fully homozygous, in which case they are equivalent to "haploid" organisms. In any case, I think the implementation can be also be made faster, by sampling from a binomial:
@bmansfeld please do check all of this, as I might be making some wrong assumption somewhere (I should also probably go and read the Tagaki paper in more detail! :smile:)