Question About Statistical Assumptions

cwarden45 commented 3 years ago

I am paraphrasing an e-mail that I received (and providing this link as a response).

If the sender is OK with it, then I will upload more specific / complete information about question.

Basically, I thought this might be the relevant information:

Using 0.3 as both methylated and unmethylated threshold
Reviewer concern that the data is not binomial.

cwarden45 commented 3 years ago

I think this could be relevant for a number of individuals.

I think these might be the most relevant points:

I would consider COHCAP a method to generate hypothesis about candidates, which can be validated.
I think it is already understood for this specific question, but I think it is common to need or benefit from changing parameter settings for different projects.
In most cases, the methylated thresholds (methyl.cutoff and unmethyl.cutoff) are being used to help filter results, kind of like the delta beta parameter (delta.beta.cutoff). I think the main exception is the Average-by-Site workflow, where a discrete status per site is being assigned. Otherwise, the statistical test is usually for continual beta or percent methylation values.

In general, I don't want people to only use COHCAP for analysis.

For example, for RRBS analysis, I would tend to use methyKit and COHCAP for testing. Sometimes I thought methylKit was better, sometimes I emphasized COHCAP more (with familiarity with all of the parameters that can be changed). However, I apologize in advance that I can't provide assistance with using those templates for specific projects (and I think the best solution is to have your own template, rather than using that exact template as a starting point).

While I don't think I currently have papers that I can reference for the DNA Methylation parts, you can see some data for the general principle for RNA-Seq here:

http://cdwscience.blogspot.com/2019/11/requiring-at-least-some-methods-testing.html

So, if you think of COHCAP like the ANOVA test for log2-transformed values for RNA-Seq, there are situations where using the methods with a negative binomial model (such as edgeR and DESeq2) can have advantages. The data linked above also includes limma-voom, which makes different statistical assumptions. However, the standard statistical test is often not horrible, and I think having an independently calculated expression value was helpful in comparing the different methods for every project. In fact, I think the less specific statistical test can sometimes help. In that context, I think the the less specialized ANOVA might be helpful applications like miRNA-Seq (which I am not currently providing data for in that link, and I also don't think I currently have publications to cite), but you can also see the sample for E-MTAB-7033 where I might argue that recovery of the causal gene with a relatively more symmetric set of up- and down-regulated genes might have been an advantage for the ANOVA test (at least with the STAR alignment). I think the ANOVA results also tended to have smaller gene lists, if you wanted to focus on a smaller number of individual genes to characterize. I am not sure if things like a larger sample size and/or less commonly tested multivariate models might also be considerations, but the point is that I would not completely take a relatively standard ANOVA test off the table as an option (and ANOVA / t-test / linear regression are often what is used for the p-value calculation in COHCAP).

cwarden45 commented 3 years ago

Also, I believe something about the binomial distribution was mentioned (along with mentioning the methylation thresholds in a separate sentence). COHCAP does not use a bimodal distribution for a statistical test, but methylated thresholds are used for the distribution of beta values (or percent methylation values). Some may call that distribution "bimodal".

Ideally, if the peaks were clear enough, then I think you would have 3 combinations of status for the 2 alleles in a human sample (homozygous unmethylated, hemizygous methylated, and homozygous methylated).

The default thresholds of 0.3 and 0.7 are meant to capture the 2 most clear homozygous peaks (perhaps in something like a cell line experiment). However, if you have heterogeneous data, that may not be possible. So, I think that is a threshold of 0.3 for both values could be used (basically merging the hemizygous and homozygous peaks, if different cells might have different methylation values).

If you keep the delta beta threshold of 0.2 (or at the very least 0.1), then that should also help avoid getting results that are only slightly above 0.3 in one group and slightly below 0.3 in the other group.

That said, I think the threshold may have more to do with being "bimodal" (or trimodal), rather than binomial assignments per read or bead (if you could determine that). For example, that is why the binomial or beta-binomial distribution is used for some BS-Seq analysis (including methylKit, with overdispersion).

cwarden45 / COHCAP

Question About Statistical Assumptions #3