davidearn / epigrowthfit

Estimating Epidemic Growth Rates
9 stars 1 forks source link

Suppressing the intercept #4

Open dushoff opened 10 months ago

dushoff commented 10 months ago

Suppressing the intercept (i.e., , formula_parameters = ~ 0+country) yields unexpected behaviour. In particular. it changes the model fit.

bbolker commented 10 months ago

A minimal reproducible example would be convenient, if you have one. This example seems to work OK:

library("epigrowthfit")
example(egf)  ## to simulate data/set up structures
e2 <-  egf(model = egf_model(curve = "exponential", family = "pois"),
             formula_ts = cbind(time, x) ~ country,
             formula_windows = cbind(start, end) ~ country,
             formula_parameters = ~ country,
             data_ts = data_ts,
             data_windows = data_windows,
             se = TRUE)

e3 <- update(e2,  formula_parameters = ~ 0 + country)
logLik(e3)
'log Lik.' -2593.562 (df=20)
> logLik(e2)
'log Lik.' -2593.542 (df=20)

The fact that these differ by 0.02 log-likelihood units (when they should be identical) doesn't worry me too much.

On the other hand, the predictions differ by 14% ...

 all.equal(predict(e2), predict(e3))
[1] "Component “estimate”: Mean relative difference: 0.135498"

Possibly a starting-parameter-values issue?

jaganmn commented 10 months ago

Possibly a starting-parameter-values issue?

Yeah ...

The algorithm employed to select init when not specified in the egf call could try harder to be reasonable in more cases.

For no-intercept models, it sets beta to zero. For intercept models, it sets beta to zero except for the intercept which it sets to mean(p). Here, length(p) is the number of windows and p[i] is a guess computed from the observations in window i. For $\log(r)$, the guess is the log of the slope of a line fit to log1p(cumsum(x)).

So the default is reasonable for intercept models with sum to zero contrasts, but otherwise typically unreasonable.

This all happens in epigrowthfit:::egf_tmb_make_parameters, with sources in R/egf_utils_tmb.R.

bbolker commented 10 months ago

I guess a reasonable no-intercept default would be to set all of the parameters in the vector to the same value (rather than setting the intercept to a sensible guess and all of the non-intercept starting values to zero)? (This may be what you had in mind already ...) (Note from ?terms.object that terms(x)$intercept is 0 or 1 depending on whether the model includes an intercept ...)

dushoff commented 10 months ago

There is an example (updated just now) at https://github.com/dushoff/notebook/blob/master/egf.R.

On Wed, Jan 3, 2024 at 12:23 PM Ben Bolker @.***> wrote:

Caution: External email.

A minimal reproducible example would be convenient, if you have one. This example seems to work OK:

library("epigrowthfit") example(egf) ## to simulate data/set up structures e2 <- egf(model = egf_model(curve = "exponential", family = "pois"), formula_ts = cbind(time, x) ~ country, formula_windows = cbind(start, end) ~ country, formula_parameters = ~ country, data_ts = data_ts, data_windows = data_windows, se = TRUE) e3 <- update(e2, formula_parameters = ~ 0 + country)

logLik(e3)'log Lik.' -2593.562 (df=20)> logLik(e2)'log Lik.' -2593.542 (df=20)

The fact that these differ by 0.02 log-likelihood units (when they should be identical) doesn't worry me too much.

One possible issue (for harder-to-estimate cases) might have to do with starting values ... ?

— Reply to this email directly, view it on GitHub https://github.com/davidearn/epigrowthfit/issues/4#issuecomment-1875713370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW5E6LB4NXULWWMQQQK3L3YMWHXJAVCNFSM6AAAAABBLW4QNWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZVG4YTGMZXGA . You are receiving this because you authored the thread.Message ID: @.***>

-- Jonathan Dushoff (https://tinyurl.com/jd-pronouns) McMaster University Department of Biology https://mac-theobio.github.io/ http://jd-mathbio.blogspot.com/

bbolker commented 10 months ago

I started to look at the logic but didn't get very far.

jaganmn commented 10 months ago

For a fixed effects formula like ~0 + country, yes. But for anything more complicated, one would have to implement logic determining which elements of beta to set (and to what) based on details in the terms object, model frame, and fixed effects model matrix.

I guess elements corresponding to terms with numeric variables could be set to 0, and elements corresponding to other terms could be set to (N-offset)/R if N is the naive estimate and R is the number of ones in the remaining columns of the model matrix ... ? Then in principle <offset> + sum(<row of model matrix> * <beta>) would give N ... ?

bbolker commented 10 months ago

I haven't yet, but it might also be worth trying to look at JD's example to see why it goes so badly wrong. (A helpful enhancement/addition/wishlist item would be the ability to set starting parameters manually [a brief look at the docs didn't suggest that there was a way to do this ...]). Alternatively, if I could figure out how to extract the underlying TMB object (with its fn and env components), that would be another way to dig in.

jaganmn commented 10 months ago

The user can set the starting value of the parameter vector(s) with egf(..., init = list(beta=, theta=)), etc. It is documented in help("egf"), but obviously a vignette would help here. The TMB object is egf(...)[["tmb_out"]] (see, e.g., help("egf-class")).

But, yes, I will take a closer look at the example.