bmansfeld / QTLseqr

QTLseqr is an R package for QTL mapping using NGS Bulk Segregant Analysis
65 stars 42 forks source link

generalise `simulateAlleleFreq()` function to any ploidy #6

Open tavareshugo opened 6 years ago

tavareshugo commented 6 years ago

In #5 I asked about the bulkSize option in runQTLseqAnalysis() if:

is it the case that this parameter should be the number of genome copies pooled, which for a diploid it would be 2x number of individuals pooled?

The answer is no, bulkSize should be the number of diploid individuals. I see this is right, because of the way the null expectation is being simulated.

I guess there are two levels to the simulation, because there are two levels of sampling:

I hand't noticed but indeed the first level of the simulation is assuming individuals are diploid, because it samples diploid individual genotypes (c(0, 0.5, 1) with probabilities relating to the expected segregation ratios in an F2 (c(1, 2, 1)/4).

But what if one is working with higher ploidy? Then the above simulation would not work.

However, the way it is implement at the moment, I think is equivalent to sampling from a binomial with probability of the event (picking an alternative allele) being 0.5 and number of trials being equal to the number of alleles sampled (2 x number of individuals).

To illustrate with code:

# Define parameters of simulation
nind <- 100  # number of sampled individuals
ploidy <- 2  # assuming diploid
replications <- 1e6  # make a lot of replications also to show how second method is faster

# Current implementation - uses number of sampled individuals
sim1 <- replicate(replications, 
                  mean(sample(c(0, 0.5, 1), size = nind, 
                              prob = c(1, 2, 1)/4, replace = TRUE)))

# Other way - sample alternative alleles with probability 0.5 from n trials given by 
## the number of alleles, which is ploidy * nind
sim2 <- rbinom(replications, ploidy*nind, 0.5)/(ploidy*nind)

# Check they are equivalent
ggplot(data.frame(sim1, sim2)) +
  geom_density(aes(sim1)) +
  geom_density(aes(sim2), colour = "pink", linetype = "dashed")

I guess the advantage is that this is general, regardless of the ploidy (besides the bonus of being faster).

I think the RIL implementation is already general as it is, because in that case we assume the individuals are fully homozygous, in which case they are equivalent to "haploid" organisms. In any case, I think the implementation can be also be made faster, by sampling from a binomial:

# Current implementation
ril1 <- replicate(replications, 
                  mean(sample(
                    x = c(0, 1),
                    size = nind,
                    prob = c(1, 1),
                    replace = TRUE
                  )))

# Other way - null expectation is also allele frequency of 0.5, except now the number of trials
## is the number of individuals (because we can look at them as equivalent to haploid organisms)
ril2 <- rbinom(replications, nind, 0.5)/(nind)

# Check they are equivalent
ggplot(data.frame(ril1, ril2)) +
  geom_density(aes(ril1)) +
  geom_density(aes(ril2), colour = "pink", linetype = "dashed")

@bmansfeld please do check all of this, as I might be making some wrong assumption somewhere (I should also probably go and read the Tagaki paper in more detail! :smile:)

bmansfeld commented 6 years ago

Hi Hugo, Thanks for the detailed posts. It's been fun to collaborate and think on this with you. Regardless of the ploidy levels. The rbinom solution is certainly faster and I used it in the SNP index simulations. I guess it eluded me for some reason in the allele freq sim. The way I went was kind of derived from the original slower code in the Tagaki et al scripts, which randomly sampled from a uniform dist and then asked if it was higher or lower than 0.5 to decide the allele. I will probably incorporate the rbinom solution alongside the changes suggested in #5, just for the sake of speed. In regard to polyploidy, and though I am not a polyploid expert the way you have it set up makes sense to me as is. However, I am aware that there are cases in autopolyploids that can behave in different ways such as multivalent pairing etc. I will be in touch with Dr. Pat Edger in our dept who is our resident polyploidy expert in the coming weeks and see if there is a way to integrate a more accurate representation of the allele frequencies. Maybe this is overkill and the method you suggested is perfect for the kinds of studies QTLseqr is for, I am just not confident enough about the polyploidy mess... Let me know your thoughts. Ben

tavareshugo commented 6 years ago

Hi Ben,

I'm no polyploidy expert either, so it seems sensible to discuss with someone more familiar with polyploid genetics! I suppose the way I put it made two assumptions:

If that's not true, then the model would be wrong, I guess. As you say, a more "realistic" model might be hard, because it probably depends on homology between chromosome copies, which will probably vary between chromosomes, individuals, varieties, species, etc...

I suppose if the function was made general, this could be explicitly mentioned in the documentation and it would be up to the user to decide. Also, if the default ploidy = 2, then a substantial number of users don't even have to worry about this. :smile:

cfljam commented 6 years ago

Im glad you are thinking about this. We have been trying QTLSeqR in some pooled data of hybrid backcross autotetraploids segregating for a major gene where the minor allele frequency of interest is 0.25. Results look sensible (similar to Popoolation CMH but sharper!) but Im interested how default expectations might not fit our system.

WallyL commented 1 year ago

Has anyone ever followed up on this? I would be very interested in learning what code modifications might be employed to facilitate better default modeling when running QTLSeqrR with a tetraploid species.