MIDASverse / MIDASpy

Python package for missing-data imputation with deep learning
Apache License 2.0
125 stars 35 forks source link

Compatibility with compositional data #10

Open ThirstyGeo opened 3 years ago

ThirstyGeo commented 3 years ago

Sometimes we know that a set of variables should add up to a given total. Measurements involving proportions, percentages, probabilities, concentrations are compositional data. These data occur often in household and business surveys, nutritional information for food, population surveys, biological and genetic data, etc.

The complication of compositional data are that the features are inherently mathematically related, leading to spurious correlation coefficients if applying conventional statistical or ML approaches (e.g., calculating Euclidean distance metrics). However, use of K-L distance is potentially a way to avoid this issue, and so MIDAS might offer a nice Deep Learning solution to imputation issues concerning compositional data.

However, some preliminary experiments using classic compositional data imputation datasets and MIDASpy hasn't performed as well as I might have expected, and I was wondering if you'd be able to comment?

For example, I imposed 30% missingness at random on the 'Kola soil horizon' geochemical dataset, and compared the known vs imputed samples against each other. You can see a marked linear trend to the imputed values.

If you are interested to take a look, here is a recent paper which references the Kola datasets, along with a copy of the data: Paper and two datasets

tsrobinson commented 3 years ago

Hi @ThirstyGeo, thanks for sharing this test. It's very interesting, and I agree it would be good if the MIDAS model could somehow account for this type of dependence between observations. I don't know exactly why you would observe linear trends here -- do things change when you adjust the size/shape of the MIDAS layers, and does including a variational autoencoder layer help (vae_layer = True) ?

I'll leave this issue open -- if you/others have any ideas how we might adjust the network to account for this type of data, then please feel free to share comments/pull requests.