TheoMichelot / hmmTMB

Fit hidden Markov models using Template Model Builder (TMB): flexible state-dependent distributions, transition probability structures, random effects, and smoothing splines.
53 stars 7 forks source link

Improved error message for zeros in `gamma` model #25

Open roaldarbol opened 7 months ago

roaldarbol commented 7 months ago

First of all, sorry about all the issues raised - just hoping to help as much as possible, as I appreciate all the work put into building this brilliant package, so it's all in a spirit of appreciation! :-)

When trying to fit data that includes zeros to a gamma model, the error message is somewhat uninformative. Maybe it could be possible to add a test of assumptions of the data given the model to give an error message that reflects which of the underlying assumptions are violated. So e.g. when running a gamma distribution, the check could be a simple any(data[["y"]] == 0). But could maybe be implemented as a general feature to check assumptions and report errors first.

Here's a reprex which produces the error only when 0 is in the data set.

library(hmmTMB, quietly = TRUE)
#> This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(tibble)
v1 <- rgamma(n = 500, shape = 1, scale = 0.01)
v2 <- rgamma(n = 500, shape = 3, scale = 2)

df_hmm <- tibble(speed_cm_s = c(v1, v2))
df_hmm_0 <- tibble(speed_cm_s = c(v1, v2, 0))

# Without 0
hid <- MarkovChain$new(data = df_hmm, n_states = 2)
par0_gamma <- list(speed_cm_s = list(shape = c(1, 3), scale = c(0.01, 2)))
obs <- Observation$new(data = df_hmm,
                       dists = list(speed_cm_s = "gamma"),
                       n_states = 2,
                       par = par0_gamma)
#> Warning: Unknown or uninitialised column: `time`.
#> Warning: Unknown or uninitialised column: `ID`.
mod1 <- HMM$new(obs = obs, hid = hid)
mod1$fit(silent = TRUE)

# With 0
hid <- MarkovChain$new(data = df_hmm_0, n_states = 2)
par0_gamma <- list(speed_cm_s = list(shape = c(1, 3), scale = c(0.01, 2)))
obs <- Observation$new(data = df_hmm_0,
                       dists = list(speed_cm_s = "gamma"),
                       n_states = 2,
                       par = par0_gamma)
#> Warning: Unknown or uninitialised column: `time`.
#> Unknown or uninitialised column: `ID`.
mod1 <- HMM$new(obs = obs, hid = hid)
mod1$fit(silent = TRUE)
#> Error in self$setup(silent = silent): Log-likelihood is NaN or infinite at starting parameters. Check that the data are within the domain of definition of the observation distributions, or try other starting parameters.

Created on 2024-04-28 with reprex v2.1.0

TheoMichelot commented 7 months ago

Thanks for the feedback. The reason the checks you mention aren't implemented is that it is less straightforward to check that the observations are in the right domain for distributions over discrete sets (e.g., Poisson, binomial). That being said, this should still be feasible (possibly ignoring the case of discrete observations at first). I think the steps could be:

  1. Add support_ as a field to the Dist class, which takes a string of the form "(-Inf, Inf)", or "(0, Inf)", or "[0, Inf)" (distinction between open and closed intervals is important)

    • add private attribute support_
    • add Dist$field() accessor
    • add support argument to Dist$initialize() and initialise in constructor (possibly with a default value)
  2. Add the support to existing distributions in R/dist_def.R

  3. Add a method Dist$check_support(x), which unpacks the character string from step 1, extracts the relevant inequalities, and checks whether those are satisfied by the input x (vector of observations)

  4. Add check in Observation$initialize()

    • loop over observation variables
    • for each, check whether observations are within support using Dist$check_support(x),
    • if not, print an error message of the form "Observations for the variable Z are not within the support of the 'gamma' distribution (0, Inf)"

Note that this won't be a priority because the error message is already reasonably informative ("Check that the data are within the domain of definition of the observation distributions").

roaldarbol commented 7 months ago

Wow, that's a super thorough fix - I like it! And no worries, there's no rush, just thought it was worth to file the issue. :-)

I must admit I couldn't find a source that specified that gamma distributions have to be >0, but it might have been hidden in formulaic language somewhere. So even just having a tiny write-up about what the "domain of definition of the observation distribution" is for the different distributions would be of great help.