improving `bulkSize` option in `runQTLseqAnalysis()`

tavareshugo commented 6 years ago

(before moving on to the issue, firstly thanks a lot for developing this very convenient package!)

In runQTLseqAnalysis() the option bulkSize only uses an integer of length 1. Therefore, if I understood the code rightly, the allele frequency simulation samples frequencies assuming both bulks have the same number of individuals.

However, in practice the two bulks might not have the same number of individuals. Perhaps it should be considered that the user could provide either one or two values for bulkSize. Then, instead of only generating one vector of "null" allele frequencies (temp_freq), two vectors could be generated, one for each pool?

Something like this might work?:

# Check it's a positive integer
if(!is.integer(bulkSize) | bulkSize <=0) stop("bulkSize must be a positive integer.")

# If length one assume both pools have same size
if(length(bulkSize) == 1){
   message("Assuming both pools have the same size: ", bulkSize)
   bulkSize <- c(bulkSize, bulkSize)
}

# Sample null allele frequencies for each pool separately
tmp_freq1 <-
   replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[1], pop = popStruc))
tmp_freq2 <-
   replicate(n = replications * 10, simulateAlleleFreq(n = bulkSize[2], pop = popStruc))

# Apply to depths
CI <- sapply(
            X = depth,
            FUN = function(x)
            {
                quantile(
                    x = simulateSNPindex(
                        depth = x,
                        altFreq1 = sample(
                            x = tmp_freq1,
                            size = replications,
                            replace = TRUE
                        ),
                        altFreq2 = sample(
                            x = tmp_freq2,
                            size = replications,
                            replace = TRUE
                        ),
                        replicates = replications,
                        filter = filter
                    ),
                    probs = intervals,
                    names = TRUE
                )
            }
)

Also, is it the case that this parameter should be the number of genome copies pooled, which for a diploid it would be 2x number of individuals pooled? If so, the documentation could be a bit more explicit about this, maybe?

I'm glad to fork and implement the suggestion above, but just wanted to check if it makes sense.

bmansfeld commented 6 years ago

Hey Hugo, Thanks for the great feedback, I hope QTLseqr can be useful in your research. This is also a good suggestion for improvement. I think it could improve the accuracy for QTLseq confidence interval calling. In fact the Yang2013 trial data has different bulk sizes and I recall thinking about this, but I was a little lazy in implementing since I felt that most users have equal bulk sizes... 😉 I like your solution and it seems like how I would have done it. If you would like to implement I would greatly appreciate it. I have unfortunately never yet had a collaboration on github, so I haven't actually ever accepted pulls requests etc. But it's about time I learned how to. I can then work on the new documentation for the function and I would need to update the runQTLseqAnalysis function documentation as well as the vignette.

In regard to your second question, no, it is just the number of diploid individuals. I can make that clearer in the the documentation. This is the way I interpreted Tagaki et al's simulation. Is there a reason that you can think of that it should be on based on haploid copies?

Thanks again for your input and possible collaboration. Ben

tavareshugo commented 6 years ago

Hi Ben,

I have to confess I myself don't have extensive experience in collaborating with code. But, my advice here would be to maybe make a release of your current version (instructions here), because this is the published one and is working.

You might also want to consider making sure the users install the latest release version with devtools (see here).

Then, maybe create a development branch (e.g. called devel) where one can be testing these other implementations without messing up the master branch.

Regarding the second question, I'll open a new issue for more discussion :smile:

bmansfeld commented 6 years ago

Different bulkSizes have been added to QTLseqr v7.0.0

bmansfeld / QTLseqr

improving `bulkSize` option in `runQTLseqAnalysis()` #5