easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
1.03k stars 94 forks source link

Warnings for binomial stats using frequency format #165

Open bwiernik opened 4 years ago

bwiernik commented 4 years ago

With the binomial() family, the response can be specified in three ways (see ?family):

  1. Case/Factor: Rows indicating cases, response a factor indicator success/failure
  2. Frequency: Rows indicating trial arms, response numeric between 0 and 1 indicating proportion of success, weights indicating number of trials in the arm.
  3. Matrix: Rows indicating trial arms, response a two-column matrix, first column indicating number of successes, second column indicating number of failures.

These three specifications yield equivalent coefficients, standard errors, and log-likelihoods, but other overall model fit statistics (e.g., deviance, AIC, df.residual, etc.) aren't accurate because the sample size used for them is wrong. The correct sample sizes for the three cases would be:

  1. nobs(model)
  2. sum(weights(model))
  3. sum(insight::get_response(model))

Because of this, the results for r2() (and insight::n_obs()) aren't accurate when the second or third specifications are used. r2() errors when the third specification is used because it doesn't expect a multi-column response, but it returns inaccurate results silently when the second or third option is used (r2_somers fails for both—frequency form because the Hmisc function rejects nonbinary values; matrix form because it's not expecting a multi-column response).

Would it be possible, when family(model)$family %in% c("binomial", "quasibinomial") to check whether the response is given in frequency or matrix form, then either return an error/warning or instead compute the statistics using appropriate methods?

Example showing errors with `r2_nagelkerke()` ``` contr <- rbinom(n = 100, size = 1, prob = .2) treat <- rbinom(n = 100, size = 1, prob = .8) dat_factor <- data.frame( group = factor(c(rep("c", length(contr)), rep("t", length(treat)))), outcome = c(contr, treat) ) dat_proportion <- data.frame( group = factor(c("c", "t")), size = c(length(contr), length(treat)), outcome_prop = c(sum(contr) / length(contr), sum(treat) / length(treat)) ) dat_matrix <- data.frame( group = factor(c("c", "t")), size = c(length(contr), length(treat)), success = c(sum(contr), sum(treat)), failure = c(length(contr) - sum(contr), length(treat) - sum(treat)) ) m_factor <- glm(outcome ~ group, family = binomial(link = "logit"), data = dat_factor) m_proportion <- glm(outcome_prop ~ group, weights = size, family = binomial(link = "logit"), data = dat_proportion) m_matrix <- glm(cbind(success, failure) ~ group, family = binomial(link = "logit"), data = dat_matrix) performance::performance::r2_nagelkerke(m_factor) performance::performance::r2_nagelkerke(m_proportion) performance::performance::r2_nagelkerke(m_matrix) ```
strengejacke commented 4 years ago

Is there a safe way to detect which kind of response / model we have?

strengejacke commented 4 years ago

For instance, here we have a matrix response where nobs() returns the correct value:

data(cbpp, package="lme4")
m_test <- glm(cbind(incidence, size-incidence) ~ period, family=binomial, data=cbpp)
nobs(m_test)
#> [1] 56
nrow(cbpp)
#> [1] 56

Created on 2020-11-24 by the reprex package (v0.3.0)

bwiernik commented 4 years ago

That's not the right number though. That's the number of herd-period groups, but not the number of cattle/cases. The R2 values computed based on n = 56 here aren't going to be correct.

library(plyr)
library(tidyr)
library(performance)

dat_agg <- lme4::cbpp
dat_ind <- lme4::cbpp %>%
    rowwise() %>%
    mutate(sick = list(c(rep(0, size - incidence), rep(1, incidence)))) %>%
    ungroup() %>%
    unnest(sick) %>%
    select(-size, -incidence) -> d

m_agg <- glm(cbind(incidence, size-incidence) ~ period, family=binomial, data= dat_agg)
m_ind <- glm(sick ~ period, family=binomial, data= dat_ind)

# same coefficients
parameters::parameters(m_agg)
parameters::parameters(m_ind)

# different R2
r2_tjur(m_agg)
r2_tjur(m_ind)

r2_somers(m_agg)
r2_somers(m_ind)

r2_nagelkerke(m_agg)
r2_nagelkerke(m_ind)

The functions currently only work when the response variable is either a factor or a 0-1 binary variable, unweighted (indicating that the rows are case-level, not group-level). If the response is non-binary (e.g., numeric values greater than 1 or non-integer values) or if it is a matrix, then the data are group-level and should not be used with the current functions.

strengejacke commented 4 years ago

So your answer to this question:

Is there a safe way to detect which kind of response / model we have?

would be:

If the response is non-binary (e.g., numeric values greater than 1 or non-integer values) or if it is a matrix, then the data are group-level and should not be used with the current functions.

And for matrix, nobs would be sum(insight::get_response(model)) , while for non-binary/non-integer, I could assume sum(weights(model))?

bwiernik commented 4 years ago

Yes that is all correct.

strengejacke commented 3 years ago

Ok, this is the solution for the matrix column, however, I think the "minus" is only required when there's a subtraction in the cbind()?

library(dplyr)
library(tidyr)
library(insight)

dat_agg <- lme4::cbpp
dat_ind <- lme4::cbpp %>%
  rowwise() %>%
  mutate(sick = list(c(rep(0, size - incidence), rep(1, incidence)))) %>%
  ungroup() %>%
  unnest(sick) %>%
  select(-size, -incidence) -> d

m_agg <- glm(cbind(incidence, size-incidence) ~ period, family=binomial, data= dat_agg)
m_ind <- glm(sick ~ period, family=binomial, data= dat_ind)

nobs(m_ind)
#> [1] 842
sum(insight::get_response(m_agg)) - sum(lme4::cbpp$incidence)
#> [1] 842

Created on 2020-12-04 by the reprex package (v0.3.0)

mattansb commented 3 years ago

Yes:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

dat_ind <- lme4::cbpp %>%
  rowwise() %>%
  mutate(sick = list(c(rep(0, size - incidence), rep(1, incidence)))) %>%
  ungroup() %>%
  select(-size, -incidence) %>% 
  unnest(sick)

dat_agg <- lme4::cbpp %>% 
  mutate(no_incidence = size - incidence)

m_agg1 <- glm(cbind(incidence, size - incidence) ~ period,
              family = binomial,
              data = dat_agg)

m_agg2 <- glm(cbind(incidence, no_incidence) ~ period,
              family = binomial,
              data = dat_agg)

m_ind <- glm(sick ~ period, family = binomial, data = dat_ind)

nobs(m_ind)
#> [1] 842

head(insight::get_response(m_agg1))
#>   incidence size
#> 1         2   14
#> 2         3   12
#> 3         4    9
#> 4         0    5
#> 5         3   22
#> 6         1   18
sum(insight::get_response(m_agg1))
#> [1] 941

head(insight::get_response(m_agg2))
#>   incidence no_incidence
#> 1         2           12
#> 2         3            9
#> 3         4            5
#> 4         0            5
#> 5         3           19
#> 6         1           17
sum(insight::get_response(m_agg2))
#> [1] 842

Created on 2020-12-15 by the reprex package (v0.3.0)

As for option 2:

Frequency: Rows indicating trial arms, response numeric between 0 and 1 indicating proportion of success, weights indicating number of trials in the arm.

@bwiernik do you have any suggestion how we might differentiate between such a case an a "regular" weighted binomial regression? Perhaps when response is a vector and all 0 < response < 1 and weights * response give integers?

bwiernik commented 3 years ago

I think the get_response() behavior is wrong. The response should be a data frame with the first column being successes and the second column be failures. It seems incorrect that the second column in get_response() is returning the values of size rather than size-incidence considering that size-incidence is what was actually specified in the formula.

Compare to the result of model.response():

model.response(model.frame(m_agg))
bwiernik commented 3 years ago

@mattansb Per the documentation for glm(), positive integer weights should be interpreted as frequencies. I think this would only apply when response is a vector and I suppose you could additionally impose the 0 < response < 1 restriction. Note that it must be the "prior" weights that are extracted, not the "working" weights.

Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers _wi, that each response _yi is the mean of _wi unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.

mattansb commented 3 years ago

Thanks @bwiernik, this was very informative.

strengejacke commented 3 years ago

I think the get_response() behavior is wrong.

No, I think actually not. See the note in documentation:

Unlike model.frame(), which may contain transformed variables (e.g. if poly() or scale() was used inside the formula to specify the model), get_data() aims at returning the "original", untransformed data (if possible). Consequently, column names are changed accordingly, i.e. "log(x)" will become "x" etc. for all data columns with transformed values.

bwiernik commented 3 years ago

Got it. Didn't realize that. In that case, extracting the response using model.response(model.frame(x)) would probably be a more straightforward approach here versus insight::get_response()

strengejacke commented 3 years ago

Yes, I'll re-write the n_obs() method the next days, so r2() are correctly computed.

strengejacke commented 3 years ago

Matrix-columns are now handles correctly. Do you have an example for non-integer response, where the sum(weights(model)) case applies?

Note that these models still have different deviance (https://github.com/easystats/insight/issues/256#issuecomment-737032089), so R2 still does not match, but maybe I find a solution here as well...

IndrajeetPatil commented 3 years ago

Just logging the current differences in r2() values to track progress on this issue:

(I will dig into the correlation issue for r2_somers on its repo.)

library(plyr)
library(tidyr)
library(dplyr)
library(performance)

dat_agg <- lme4::cbpp
dat_ind <- lme4::cbpp %>%
  rowwise() %>%
  mutate(sick = list(c(rep(0, size - incidence), rep(1, incidence)))) %>%
  ungroup() %>%
  unnest(sick) %>%
  select(-size, -incidence) 

m_agg <- glm(cbind(incidence, size - incidence) ~ period, family = binomial, data = dat_agg)
m_ind <- glm(sick ~ period, family = binomial, data = dat_ind)

# same coefficients
parameters::parameters(m_agg)
#> Parameter   | Log-Odds |   SE |         95% CI |     z |      p
#> ---------------------------------------------------------------
#> (Intercept) |    -1.27 | 0.14 | [-1.56, -0.99] | -8.76 | < .001
#> period [2]  |    -1.17 | 0.29 | [-1.77, -0.62] | -4.02 | < .001
#> period [3]  |    -1.30 | 0.31 | [-1.95, -0.72] | -4.16 | < .001
#> period [4]  |    -1.78 | 0.41 | [-2.68, -1.04] | -4.31 | < .001
parameters::parameters(m_ind)
#> Parameter   | Log-Odds |   SE |         95% CI |     z |      p
#> ---------------------------------------------------------------
#> (Intercept) |    -1.27 | 0.14 | [-1.56, -0.99] | -8.76 | < .001
#> period [2]  |    -1.17 | 0.29 | [-1.77, -0.62] | -4.02 | < .001
#> period [3]  |    -1.30 | 0.31 | [-1.95, -0.72] | -4.16 | < .001
#> period [4]  |    -1.78 | 0.41 | [-2.68, -1.04] | -4.32 | < .001

# different R2
r2_tjur(m_agg)
#>   Tjur's R2 
#> 0.001039836
r2_tjur(m_ind)
#>  Tjur's R2 
#> 0.05058399

r2_somers(m_agg)
#> Error in correlation::cor_test(input, "y", "pred", method = "somers"): The names you entered for x and y are not available in the dataset. Make sure there are no typos!
r2_somers(m_ind)
#> Somers' Dxy 
#>   0.3556833

r2_nagelkerke(m_agg)
#> Nagelkerke's R2 
#>       0.2810604
r2_nagelkerke(m_ind)
#> Nagelkerke's R2 
#>      0.09161516

Created on 2021-02-24 by the reprex package (v1.0.0)

strengejacke commented 3 years ago

bump

Can anyone provide an example model for this case:

Frequency: Rows indicating trial arms, response numeric between 0 and 1 indicating proportion of success, weights indicating number of trials in the arm.

mattansb commented 3 years ago

Here are all the options:

library(dplyr, include.only = c("%>%" , "group_by", "summarise", "n"))
#> Warning: package 'dplyr' was built under R version 4.0.5

mtcars2 <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    p = mean(am),
    w = n(),
    success = sum(am == 1),
    fail = sum(am == 0)
  )

m0 <- glm(am ~ factor(cyl), mtcars,
          family = binomial())

m1 <- glm(p ~ factor(cyl), mtcars2,
          weights = w,
          family = binomial())

m2 <- glm(cbind(success, fail) ~ factor(cyl), mtcars2,
          family = binomial())

m3 <- glm(cbind(success, w - success) ~ factor(cyl), mtcars2,
          family = binomial())

insight::n_obs(m0)
#> [1] 32
insight::n_obs(m1)
#> [1] 3
insight::n_obs(m2)
#> [1] 32
insight::n_obs(m3)
#> [1] 32

Created on 2021-04-20 by the reprex package (v1.0.0)

strengejacke commented 3 years ago
library(dplyr, include.only = c("%>%" , "group_by", "summarise", "n"))

mtcars2 <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    p = mean(am),
    w = n() * (1 / mean(am)),
    trial = n(),
    success = sum(am == 1),
    fail = sum(am == 0)
  )

m0 <- glm(am ~ factor(cyl), mtcars,
          family = binomial())

m1 <- glm(p ~ factor(cyl), mtcars2,
          weights = w,
          family = binomial())

m2 <- glm(cbind(success, fail) ~ factor(cyl), mtcars2,
          family = binomial())

m3 <- glm(cbind(success, trial - success) ~ factor(cyl), mtcars2,
          family = binomial())

insight::n_obs(m0)
#> [1] 32
insight::n_obs(m1)
#> [1] 32
insight::n_obs(m2)
#> [1] 32
insight::n_obs(m3)
#> [1] 32

Created on 2021-04-20 by the reprex package (v2.0.0)

strengejacke commented 3 years ago

n_obs now works correctly, however, the deviance differs and this r2() is still wrong for some models. @bwiernik you suggested:

to check whether the response is given in frequency or matrix form, then either return an error/warning or instead compute the statistics using appropriate methods?

what would be an "appropriate method"? Else, if not Bernoulli, I would yield a warning.

bwiernik commented 3 years ago

m1 <- glm(p ~ factor(cyl), mtcars2, weights = w, family = binomial())

I'm confused. The weights should be trial here--the total number of trials.

mattansb commented 3 years ago

Yes, @strengejacke why did you change the computation from my perfect example??

strengejacke commented 3 years ago

I'm confused. The weights should be trial here--the total number of trials.

But then it doesn't align with the nobs, which do not sum to 32.

bwiernik commented 3 years ago

For this model:

m1 <- glm(p ~ factor(cyl), mtcars2, weights = trial, family = binomial())

n_obs() should be calculated as sum(weights(m1). I thought you had implemented that a while ago?

bwiernik commented 3 years ago

I don't know what w = num_trials * (1 / percent successes) is supposed to be?

strengejacke commented 3 years ago

sum(weights(m1)

Ok, I had response * weights...

strengejacke commented 3 years ago
library(dplyr, include.only = c("%>%" , "group_by", "summarise", "n"))

mtcars2 <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    p = mean(am),
    trial = n(),
    success = sum(am == 1),
    fail = sum(am == 0)
  )

m0 <- glm(am ~ factor(cyl), mtcars,
          family = binomial())

m1 <- glm(p ~ factor(cyl), mtcars2,
          weights = trial,
          family = binomial())

m2 <- glm(cbind(success, fail) ~ factor(cyl), mtcars2,
          family = binomial())

m3 <- glm(cbind(success, trial - success) ~ factor(cyl), mtcars2,
          family = binomial())

insight::n_obs(m0)
#> [1] 32
insight::n_obs(m1)
#> [1] 32
insight::n_obs(m2)
#> [1] 32
insight::n_obs(m3)
#> [1] 32

Created on 2021-04-20 by the reprex package (v2.0.0)

bwiernik commented 3 years ago

m1 <- glm(p ~ factor(cyl), mtcars2, weights = w, family = binomial())

You mean:

m1 <- glm(p ~ factor(cyl), mtcars2,
          weights = trial,
          family = binomial())
bwiernik commented 3 years ago

what would be an "appropriate method"? Else, if not Bernoulli, I would yield a warning.

For now, just return a warning and maybe NULL values for r2 for m1, m2, m3. The deviance isn't really comparable to bernoulli trial format,

Here is a useful set of examples from Ben Bolker on aggregated binomial modeling https://bbolker.github.io/stat4c03/notes/binomial_poisson.pdf

bwiernik commented 3 years ago

@strengejacke Okay, we can compute the analogous log likelihood statistic for aggregated binomial models as logLik(m) - sum(lchoose(n_total, n_success)) where n_total is the total number of cases represented by a stratum and n_success is the number of cases in that stratum.

This term is part of the likelihood for binomial models, but it's a constant, so it doesn't impact the model fit and R (annoyingly) omits it.

From there, we can compute the correct deviance, AIC, r2, etc.

strengejacke commented 3 years ago

What is lchoose?

bwiernik commented 3 years ago

log (N choose k) -- the log of the number of possible combinations.

See the first answer here: https://stats.stackexchange.com/questions/144121/logistic-regression-bernoulli-vs-binomial-response-variables