JenniNiku / gllvm

Generalized Linear Latent Variable Models
https://jenniniku.github.io/gllvm/
48 stars 20 forks source link

NAs in data #34

Closed adhamab closed 1 year ago

adhamab commented 3 years ago

Hi there,

for some species in some rows (plots) I have NAs (where the species count was not recorded). It does not appear possible to include rows with missing data in the model matrix.

Is there a way around this?

Best regards, Adham

BertvanderVeen commented 3 years ago

Technically, it would be possible to handle missing values in the data matrix, by estimating the parameters using the observed. Even better, would be to integrate over missing data (imputation), I suppose.

Neither approach is currently supported, and it would take some doing to implement. So, my answer is no: I don't think so, but @JenniNiku might have a different answer. I think the quick-and-dirty way of accounting for this is to fix NAs e.g. to the mean or median of the observed values, but I have no idea what the consequence of that on your results would be.

adhamab commented 3 years ago

Thank you! Your quick and dirty way was my possible backup solution, so I will test that.

gerverska commented 1 year ago

Thanks @adhamab for the question--I'm still wondering about this too. It seems like helpful functionality for fitting hurdle models.

BertvanderVeen commented 1 year ago

I'll think about addressing this. Out of curiosity, what is the cause of the NA?

BertvanderVeen commented 1 year ago

OK, I have added some modifications on my update branch here: https://github.com/BertvanderVeen/gllvm/tree/gllvmUpdate3 that should allow you to have some NAs in your response data. It is not (yet?) possible to have NAs in the predictor matrices. However, I would like to note that very few software allows for missingness in the predictors; it is generally difficult to properly deal with.

If you want, you can have a try and let me know if it works. In essence, for a NA observation in the response the likelihood evaluation is skipped. Then, the corresponding value for that cell is predicted with the information from the non-missing observations and has corresponding uncertainty. As an aside, the missing value is of course not represented in residual plots, as there is no data to validate the prediction against.

gerverska commented 1 year ago

@BertvanderVeen Wow, thanks Bert! I'll try this out today, ASAP!

I'm still considering some of the implications, but the NAs that I'm encountering are the result of a hurdle-modeling approach that I'm trying out. I've tried this out with the HMSC R package (detailed in the associated book, Joint Species Distribution Modelling, With Applications in R), and the approach they advocate for involves a hurdle model where a presence-absence matrix is modeled separately from an abundance matrix with zeroes set to NA.

Part of the reason they advocate this was due to a lack of zero-inflated distribution options (unlike gllvm), so a hurdle model approach might not be necessary for gllvm. However, these ZI distribution options seem to mostly cater to taxon count data, but my taxon responses are actually positive continuous (with many zeroes), so it seems like the Tweedie, gamma, exponential, and Gaussian families will be more appropriate, and it's my rudimentary impression that these don't play nicely with a lot of zeroes...thus my pursuit of a hurdle model with gllvm (whew, long response, sorry).

gerverska commented 1 year ago

I'd love to get everyone's impression about an approach like this. Full disclosure, I'm working on a microbiome sequencing dataset, and I know y'all have gotten a lot of inquiry regarding "compositionality" in a very helpful discussion thread, where some of the work of Greg Gloor was linked.

Long story short (again?), I've applied one of the data transformations he recommends (the additive log ratio), which seems to make non-zero abundance data more amenable to description by the distributional families I mentioned (the response data are now on a continuous scale), thus my perceived need of a hurdle model to deal with the presence-absence and abundance portions of my dataset separately (assuming that Tweedie, gamma, exponential, and/or Gaussian distributions wouldn't converge too well with all of the zeroes).

I know that the discussion had some other recommendations for addressing compositionality, but it seems like the recommendations made (i.e., defining row effects) seemed more geared to address the differences in library size (i.e., sequencing/sampling depth) between samples, rather than addressing some of the points brought up by Gloor. I think it's not helpful that microbiome folks have had these siloed discussions about compositionality, and I think the term "compositionality" sensu Gloor + Aitchison is not incredibly intuitive--according to them, the data are compositional because of the fact that number of sequences obtained for each sample is arbitrarily fixed and then split among the taxa occurring in each sample.

The fact that the number of sequences varies arbitrarily from sample to sample seems like it can be addressed with the row effects approach described by @dwarton, but the fact that the number of sequences obtained is capped/fixed creates the main issues associated with compositionality sensu Gloor--because of the fixed sequencing depth, microbiome folks are already stuck in a situation where they can't really talk about shifts in the underlying actual abundances of taxa between sample unless the read counts of each taxon in each sample are divided by a fixed (or approximately fixed) feature in each sample.

I'll probably cross post into the discussion I linked, but it seems helpful to give some context here.

BertvanderVeen commented 1 year ago

We might want to start a separate discussion on this, or continue in the compositionality thread rather than here, so that others can find it in the future. Today we (Jenni) opened the discussions forum in the gllvm github in order to better separate discussion from bugs or issues in the package, so could you post this in the existing compositionality thread there?

Generally, my attitude towards data transformations is; avoid them if you can. I think they tend to mess many things up, inference most of all. One of the great benefits of GLMs (and gllvm by extension) is that we have all these distributions that can facilitate the properties of data at hand. As I said in that thread; don't adjust the data to fit your model, but adjust the model to fit your data(-generating process).

BertvanderVeen commented 1 year ago

@gerverska see David's answer here

smithja16 commented 1 year ago

I for one am grateful that NAs are now allowed in response. I'm also using a hurdle model to model biomass (Probit and Gamma), mainly because this seems more stable (in terms of fitting) than a Tweedie.