Closed epurdom closed 3 years ago
Yes, that is indeed wrong. Thankfully, we always provided n_samples
as a parameter and did not rely on that line of code. I'm not a big fan of trying to estimate the number of samples in this case: if I am not mistaken, what we are interested in is not the number of samples necessarily, but the number of samples involved in the model. There are many cases, both on the S. bicolor
timecourse data or the shoemaker data where we only consider two of the conditions and not the whole set of conditions (such as doing the timecourse DE analysis only on control versus post-flowering). I'm tempted to remove this estimation of the number of samples, and only rely on the parameter given by the user (in this case, us when we fit the model).
The other two options are 1) add the warning as my comments say we should; 2) have a better estimate of the number of samples by looking at which of the rows of the basis matrix have none-zeros elements.
Ok, I will fix it just on principle anyway.
Just FYI, this came up because I am working on changing the computations to be more flexible for additional parameters in the formula model. Basically, I'm rewriting the code to use a full constraint matrix implied by the null hypothesis (and a function to create this matrix automatically), so that things like the df and calculating the null fitted values don't implicitly depend on the formula being of our default form. For example, if you do
~Group:ns(Timepoint) + Replicate + 0
then the code fails because there are these other variables in the basis_matrix
of the object that don't involve the group comparison. Hence me trying to make sure I understand the original code.
Yes, indeed, these formulas are problematic with the current code. Does the code actually fail or does it give wrong results? I was worried when we talked about it now several months ago that the code would run but give wrong results.
It sometimes will hit an error, but not always. The error we hit was the calculation of df somewhere gives a fractional number (because #variables/#groups wasn't an integer), and that created an error. However, otherwise when we monkeyed with the formula so that happened to be an integer, it ran.
However, if you use all the data in creating the fit, then you would need to have the degrees of freedom reflect that, even if those samples weren't in the groups you were comparing in the contrast. So I think it should be all data. There might be special cases where the orthogonality of the design would mean that you would get the same fit even if you hadn't included those samples, in which case either might be valid; but generally I think you need all samples.
I'm also not clear why (RSS0-RSS1)/RSS1 is being multiplied by n
in the calculation of the F-statistic (line 125)
n * (ss0 - ss1)/(ss1)
I'm also confused about the LRT value, because normally it would be n log [RSS(NULL)/RSS(FULL)] is chisq distributed. It seems like you are taking the n(RSS0-RSS1)/RSS1 per group and summing it, and there doesn't seem to be a log, so I'm a little confused. Is there a simplification in the group case that I'm forgetting?
On line 122 in
DE_timecourse.R
we haveLooking at the code, I think this should be the number of samples (n), which I believe should be found as
nrow(basis)
. Later, there iswhich is the second degrees of freedom for a F statistic, so it should be n, the number of samples (like the variable name suggests).