NelleV / moanin

Timecourse transcriptomic analysis
https://nellev.github.io/moanin/
Other
5 stars 0 forks source link

Is the compute_pvalue for DE_timecourse giving the wrong value for n? #90

Closed epurdom closed 3 years ago

epurdom commented 3 years ago

On line 122 in DE_timecourse.R we have

    if(is.null(n_samples)){
        n_samples <- ncol(basis)
        # FIbasisME raise warning
    }

Looking at the code, I think this should be the number of samples (n), which I believe should be found as nrow(basis). Later, there is

df2 <- n_samples - degrees_of_freedom * n_groups

which is the second degrees of freedom for a F statistic, so it should be n, the number of samples (like the variable name suggests).

NelleV commented 3 years ago

Yes, that is indeed wrong. Thankfully, we always provided n_samples as a parameter and did not rely on that line of code. I'm not a big fan of trying to estimate the number of samples in this case: if I am not mistaken, what we are interested in is not the number of samples necessarily, but the number of samples involved in the model. There are many cases, both on the S. bicolor timecourse data or the shoemaker data where we only consider two of the conditions and not the whole set of conditions (such as doing the timecourse DE analysis only on control versus post-flowering). I'm tempted to remove this estimation of the number of samples, and only rely on the parameter given by the user (in this case, us when we fit the model).

The other two options are 1) add the warning as my comments say we should; 2) have a better estimate of the number of samples by looking at which of the rows of the basis matrix have none-zeros elements.

epurdom commented 3 years ago

Ok, I will fix it just on principle anyway.

Just FYI, this came up because I am working on changing the computations to be more flexible for additional parameters in the formula model. Basically, I'm rewriting the code to use a full constraint matrix implied by the null hypothesis (and a function to create this matrix automatically), so that things like the df and calculating the null fitted values don't implicitly depend on the formula being of our default form. For example, if you do

~Group:ns(Timepoint) + Replicate + 0 

then the code fails because there are these other variables in the basis_matrix of the object that don't involve the group comparison. Hence me trying to make sure I understand the original code.

NelleV commented 3 years ago

Yes, indeed, these formulas are problematic with the current code. Does the code actually fail or does it give wrong results? I was worried when we talked about it now several months ago that the code would run but give wrong results.

epurdom commented 3 years ago

It sometimes will hit an error, but not always. The error we hit was the calculation of df somewhere gives a fractional number (because #variables/#groups wasn't an integer), and that created an error. However, otherwise when we monkeyed with the formula so that happened to be an integer, it ran.

However, if you use all the data in creating the fit, then you would need to have the degrees of freedom reflect that, even if those samples weren't in the groups you were comparing in the contrast. So I think it should be all data. There might be special cases where the orthogonality of the design would mean that you would get the same fit even if you hadn't included those samples, in which case either might be valid; but generally I think you need all samples.

epurdom commented 3 years ago

I'm also not clear why (RSS0-RSS1)/RSS1 is being multiplied by n in the calculation of the F-statistic (line 125)

n * (ss0 - ss1)/(ss1)
epurdom commented 3 years ago

I'm also confused about the LRT value, because normally it would be n log [RSS(NULL)/RSS(FULL)] is chisq distributed. It seems like you are taking the n(RSS0-RSS1)/RSS1 per group and summing it, and there doesn't seem to be a log, so I'm a little confused. Is there a simplification in the group case that I'm forgetting?