easystats / parameters

:bar_chart: Computation and processing of models' parameters
https://easystats.github.io/parameters/
GNU General Public License v3.0
427 stars 36 forks source link

Appropriate Statistical Test Question #586

Closed delucalab closed 3 years ago

delucalab commented 3 years ago

Dear All,

I have collected data on the presence and number of lesions of 1) different types and 2) within different locations in a patient cohort. I looked at the following summary (http://htmlpreview.github.io/?https://github.com/strengejacke/mixed-models-snippets/blob/master/overview_modelling_packages.html) to come up with an approach and wanted to get some feedback. Ultimately, I want to perform the following:

1) Case-level approach to look for differences in the proportion of cases with lesions at the different locations (region 1, 2, and 3) and of the different lesion types (type 1, 2, and 3). To accomplish this, I thought I would perform a glmer (family=binomial) model since the data is binary (lesion vs not) at the nested levels of location and lesion type for each case. Would you agree with this method?

2) Lesion-level approach using the data I have on number of lesions (rather than just presence) to look for differences in the proportion of lesions that are found at particular locations (region 1, 2, and 3) and are of particular types (type 1, 2, and 3). To accomplish this, I thought I would perform glmmTMB(ziformula, family=beta_family/betabinomial) model on the proportion of total lesions identified for each case that fall into the different categories (ie locations and types) since the data is proportional, includes 0 and 1, and is nested for each case. Would this be the best method?

As always, thank you for your insights and help! Do not hesitate to let me know if I need to clarify my aims or the type of data any further.

bwiernik commented 3 years ago

It's best to think of both of these as "case-level"--the difference here is whether you are dichotomizing lesions into a binary variable (present or not) versus leaving it as a count variable (number of lesions).

I'll discuss model form in a minute. Let's first discuss your predictors. You have described 3 predictor variables—case, type, and location. You need to decide whether to model each of these as a fixed-effects predictor or as a random-effects predictor. To decide, ask yourself, are you interested in the specific values of the variables themselves (e.g., these cases, these locations), or do you want to treat these values as samples from a broader population for the variable and to generalize to that broader population (e.g., do you want to generalize to the population of potential cases, to the population of potential locations)? Another way to think about this is--do you want to take extreme values for one of the cases/locations/types at face value, or do you want to regularize them a bit and pull them toward the overall mean (this is often reasonable). If you want to generalize to a broader population or regularize values, model the variable as a random grouping factor. Otherwise, model it as a fixed factor.

For example, if you want to model case as a random factor but type and location as fixed factors, you could use the formula: lesion ~ location + type + (1 | case)

If you want to model all 3 as random factors, then you could use: lesion ~ (1 | location) + (1 | type) + (1 | case)

Both of the above formulations treat the three types of grouping factors as distinct (but correlated): cases may have predispositions toward more lesions generally, but not predispositions to specific types or locations of lesions.

If you want to consider predispositions toward specific types/locations across cases, you can add an interaction to your grouping structure: lesion ~ (1 | location) + (1 | type) + (1 | case / location) + (1 | case / type)

Here, I've left in the direct effects of location and type to consider that these factors may have main effects in addition to their individual case-level effects. You could consider dropping those.

Now, turning to your question about model family.

The most appropriate form I would argue is your (2)--to model the number of lesions, which may include zero. For this approach, you likely want to choose a family that (1) reflects a count variable, (2) is flexible about the mean and variance for the lesion counts, and (3) also flexibly models the absence of any lesions. For this, I would recommend a zero-inflated negative binomial model, as it has all of these features. glmmTMB can fit this family of models, with random effects for main count portion of the model, but only fixed effects for the zero-inflation.

glmmTMB(lesion ~ (1 | location) + (1 | type) + (1 | case), ziformula = ~ lesion ~ location + type, family = nbinom2())

If you want to model the zero-inflation with random effects as well, use brms.

brms::brm(bf(lesion ~ (1 | location) + (1 | type) + (1 | case), zi = ~ lesion ~ location + type, family = zero_inflated_negbinomial())

Note that you can use this model to answer your first question (what predicts presence of any lesions versus none), but does so better than a binomial model because the binomial model treats any number of lesions greater than zero as the same.

delucalab commented 3 years ago

Thank you so much! So would using the suggested model address assessing the differences shown in the following tables? The first table is the proportion of cases with particular lesion type/location while the second is the proportion of lesions of a certain type/location. I think this is why I was thinking of trying to split analyses into case-level vs lesion-level but may just be confusing myself.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Type 1 | Type 2 | Type 3 -- | -- | -- | -- Region 1 | 22/102 (21.5%) | 51/102 (50.0%) | 42/102 (41.2%) Region 2 | 17/116 (14.7%) | 38/116 (32.8%) | 44/116 (37.9%) Region 3 | 14/106 (13.2%) | 35/106 (33.0%) | 33/106 (31.1%)

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | Type 1 | Type 2 | Type 3 -- | -- | -- | -- Region 1 | 28/177 (15.8%) | 79/177 (44.6%) | 70/177 (39.5%) Region 2 | 23/149 (15.4%) | 56/149 (37.6%) | 70/149 (47.0%) Region 3 | 24/131 (18.3%) | 55/131 (42.0%) | 52/131 (39.7%)

bwiernik commented 3 years ago

The issue is that first table treats a case with 1 lesion and 10 lesions identically. The model I suggest can make that comparison, but it doesn’t assume they are identical the way a binomial model would.

delucalab commented 3 years ago

That makes sense! Thank you! One additional question: What are the pros and cons for including the random effect in the zero inflation? Is there a rule of thumb for when you should?

bwiernik commented 3 years ago

Same arguments apply as I laid out for the mean function.

mattansb commented 3 years ago

The most appropriate form I would argue is your (2)--to model the number of lesions, which may include zero. For this approach, you likely want to choose a family that (1) reflects a count variable, (2) is flexible about the mean and variance for the lesion counts, and (3) also flexibly models the absence of any lesions. For this, I would recommend a zero-inflated negative binomial model

Just to clarify that negative-binomial models can be used to model 0-counts; zero-inflated models model excess zeros - that is, when you have more zeros that is expected from the NB distribution alone (: