USEPA / SSN2

SSN2: Spatial Modeling on Stream Networks in R
https://usepa.github.io/SSN2/
GNU General Public License v3.0
12 stars 3 forks source link

possible issue with degrees of freedom in anova function? #25

Open katiewampler opened 1 week ago

katiewampler commented 1 week ago

Not a stats expert, so perhaps I'm just not understanding. But in the anova function for both glm and lm, whenever I compare two models, no matter what the formulas are, it always reports the difference in degree of freedom as 0, resulting an p-value of 0.

It looks like the code is pulling the degrees of freedom from object$npar (which looks like it's related to the number of autocorrelation stuctures used and doesn't seem to change when the autocorrelation structure is constant), whereas p seems to be the number of parameters used in the formula, which to my understanding is where the change in degrees of freedom should come from.

katiewampler commented 1 week ago

Not sure if this is related or a separate issue, but regardless of the degrees of freedom, the chi-squared values I get from the SSN2 package are very different from those I get from the betareg package for beta family models, it seems like they return very different logLik values from the optim function. This results in very different p-values. If I test a model with one variable removed, the p-value from the logLikelihood test from the betareg package matches the p-value for the removed coefficient value pretty closely, but the logLikehood test from the SSN2 package is very different.

Example below: lsn.ssn.zip

library(SSN2) library(sf) library(betareg) library(dplyr)

ssn <- ssn_import("~/1_Research/5_McKenzie_EEMs/data/parafac_ssn/lsn.ssn") #load SSN df<- ssn_get_data(ssn) df[,6:11] <- sapply(df[,6:11] %>%st_drop_geometry(), function(x){x/100}) #convert to a fraction ssn <- ssn_put_data(df, ssn) #put converted data in model

make beta regression models using SSN2 package

mod1a <- ssn_glm(Comp_1 ~ DNBR * ARIDITY, ssn, family="beta", tailup_type = "none", taildown_type = "none", euclid_type = "none", estmethod="ml") mod2a <- ssn_glm(Comp_1 ~ DNBR + ARIDITY, ssn, family="beta", tailup_type = "none", taildown_type = "none", euclid_type = "none", estmethod = "ml") summary(mod1a)

SSN2:::anova.ssn_glm(mod1a, mod2a)

with betareg

mod1b <- betareg(Comp_1 ~ DNBR * ARIDITY, df) mod2b <- betareg(Comp_1 ~ DNBR + ARIDITY, df)

summary(mod1b) lmtest::lrtest(mod1b, mod2b) #anova likelihood ratio test

coefficients and significance of coefficients are very similar

anova results are very different

michaeldumelle commented 1 week ago

Thanks @katiewampler ; I'll answer both your questions below.

Question 1

There are two things going on here regarding the anova likelihood ratio test (LRT) for nested models. First, it is important to clarify whether you are talking about a model fit using REML (estmethod = "reml") or ML (estmethod = "ml"). If REML, the difference in degrees of freedom is the difference between the number of autocorrelation parameters between the two models (as the fixed effect structure for two REML models must be the same). If ML, the difference in degrees of freedom is the difference between the number of autocorrelation parameters plus the difference between the number of fixed effect parameters (the fixed effect structure for two ML models does not have to be the same). So the code works properly for REML but you have identified a bug with ML. More about technical details for AIC/AICc/BIC/anova can be found at this link (note that spmodel and SSN2 use the same fitting engine; we are working on a technical details document for SSN2 which will eventually be placed on the SSN2 website).

Question 2

The ssn_glm() models use a slightly different likelihood than those from betareg. Also, as specified in your code, thessn_glm() models also have allow for a nugget effect (see nugget_type), which betareg models do not. The nugget effect was estimated to be pretty small, and as such, parameter estimates and p-values from ssn_glm() and betareg are very similar (though the likelihoods are different). Generally we advise using REML, though we recognize a benefit of ML is using likelihood-based statistics to compare models with varying fixed effect structures. AIC/AICc/BIC/anova LRT models for REML require the same fixed effect structure, while deviance() and loocv() fit statistics can be used no matter the estimation method or fixed effect structure. As an aside, we are working identifying some stability updates for the ML estimation likelihood when all autocorrelation variance parameters approach zero.

If a goal is study the impact of the spatial stream network structure on model fit, we recommend comparing models all fit using ssn_glm() because they use the same likelihood. If proceeding with a single, final model is a goal, we recommend that model is eventually fit using REML.

What Next?

I will edit this message and close the issue when the bug for Question 1 is fixed.