atahk / pscl

Political Science Computational Laboratory
64 stars 14 forks source link

zeroinfl with continuous data #5

Closed Anna1288 closed 4 years ago

Anna1288 commented 5 years ago

I am running into an issue where zeroinfl will not work because my data is not in integer form. I'm a little confused because I thought Poisson models could be used with continuous/rate data. Is there any way to get around this to be able to do a zero-inflated Poisson GLM?

atahk commented 5 years ago

Poisson regression really make sense for continuous/rate data. The model assumes the responses follow a Poisson distribution, which can only take on integer values. While you can work around this by treating the pmf of the Poisson as if it could be evaluated at non-integer values, this doesn't make a lot of theoretical sense and can lead to a lot of problems. The problems is particularly acute in the context of a zero-inflated model.

In most cases where the response is represents a rate, I'd suggest a different model. But that depends on the type of the response. If the response is truly continuous and can take on any non-negative value, models such as gamma regression (i.e., a GLM with a gamma-distributed response) often makes sense. But it doesn't make sense to mix these with zero inflation, either, because the probability of seeing a zero that's not from the inflation stage is zero. Hence, any zeros are from the inflation stage (at least with probability one) and you can just run two separate models. I'm assuming that's not the case here.

The usual case where the outcome is non-integer and represents a rate and one might want to apply a zero-inflated model is where the outcome is a count divided by the observation time or something of that sort (e.g., the size of the population at risk). For example, an outcome variable of this sort might be the rate at which individuals with a certain condition are hospitalized, defined as the number of times a person was hospitalized divided by the length of time they had the condition, or the rate at which a rare disease occurs in different regions, defined by the number of cases per region divided by the population. Here, we really do have a non-zero chance of observing a rate of zero.

If you're trying to model something like this, there's an easy solution: use a zero-inflated Poisson model with an exposure variable (this can also be applied with a negative. That would mean that the outcome would be coded as the count (rather than the rate) but the denominator in computing the rate (length of time, population size, etc.) enters as the exposure variable. The outcome is still discrete, so a Poisson distribution makes sense, but the exposure value scales the rate parameter. So, when the exposure is doubled, the model predicts that the expected count (conditional on making it past the inflation stage) is also doubled, and so on. But, because the outcome is still discrete, there's no problem combining this with an inflation stage.

You add the exposure variable to your Poisson regression (or other count model with a log link function, such as negative binomial regression) by adding an offset for the log of the exposure variable. You can do this with the zeroinfl command in pscl by adding the argument offset=log(exposure.variable) to the command or by adding + offset(log(exposure.variable)) to the formula where exposure.variable is, of course, the name of the relevant variable. This also works with Poisson regression using the standard glm command in R and with negative binomial regression using the glm.nb command in MASS (except that glm.nb command lacks an offset argument, so you can only do it by adding + offset(log(exposure.variable)) to the formula).

Let me know if that answers your question. If not, perhaps it would help if you could describe your data a bit more.

oushujun commented 3 years ago

@atahk Thank you for your detailed explanation and guidelines to overcome the 0 rate issue. I think my data fall into your description with observations during exposures, but the exposure is not exactly what it looks like in a general sense, so please let me provide more background.

I study a type of genetic element called LTR retrotransposon which is a virus inside plants (aka. retrovirus). As a virus, LTR retrotransposons can amplify and integrate into the genome, resulting in more intact elements for further amplification, and can also be removed by the genome as a defense. The removal will result in a small remanent called solo LTR. The solo/intact ratio denotes how fast the virus was removed by the host. With more intact LTR retrotransposons (exposure), you get a higher chance to observe more solo LTRs.

To study the removal rate (solo/intact ratio) between different plants, here I denote the count of solo LTRs as observations and the count of intact elements as exposures. The problem is, in some situations (not rare), intact elements can be all removed and have 0 counts while solo is non-0.

If there are 0s in the exposure (i.e., intact), then the log transformation will fail. Can you briefly explain the purpose of a log transformation here and how 0s may be compensated?

If I assume 0s in exposure are erroneous and have them removed, I can get the offset = log(exposure.variable) running, but in zeroinfl, adding offset to the formula or using the offset parameter produced different results as measured by dispersion. Can you comment on this?

M4.1 <- zeroinfl(solo ~ log(intact) + genome | genome, dist = 'negbin', data = All_count_sub[All_count_sub$intact>0,]) E2 <- resid(M4.1, type = "pearson") N <- nrow(All_count_sub) p <- length(coef(M4.1))
sum(E2^2) / (N - p) # 1.31, small overdispersion

M4.2 <- zeroinfl(solo ~ genome | genome, offset = log(intact), dist = 'negbin', data = All_count_sub[All_count_sub$intact>0,]) E2 <- resid(M4.2, type = "pearson") N <- nrow(All_count_sub) p <- length(coef(M4.2))
sum(E2^2) / (N - p) # 1.66, some overdispersion

All_count_sub.RData.zip

I also include the data frame I am using for you to further learn about the dataset. Please unzip it and use load() to load it to R. Thanks for helping with this.

Best, Shujun