MIDASverse / MIDASpy

Python package for missing-data imputation with deep learning
Apache License 2.0
125 stars 35 forks source link

Minimum and maximum value arguments (constraints) #9

Open ThirstyGeo opened 3 years ago

ThirstyGeo commented 3 years ago

I'm working with Dirichlet distributions and the compositional data simplex, and am really enjoying MIDASpy's flexibility when dealing with this data (related to K-L divergence in the decoder). However, there is a tendency to produce negative values in the numerical feature data I have been using.

In the case of compositional data, there is a constraint of zero as a minimum value. Other imputation approaches allow setting maximum and minimum value arguments (e.g., Scikit-Learn) and importantly these can be set per feature (autoimpute). Is this an argument which could be added to the package? It would be a major help to people working in several disciplines.

tsrobinson commented 3 years ago

Thanks @ThirstyGeo for raising this issue -- completely agree that it would be a really useful feature. The best way to implement this is probably to allow users to change the activation functions for specific output nodes in the network -- then the model will incorporate this range trimming within training itself.

We will look into this as a priority, and if you had any further suggestions/pull requests they'd be greatly received.

ThirstyGeo commented 3 years ago

That's great @tsrobinson! Much appreciated to focus on this. I'll think a bit more through the typical workflows and see if I can create a which represents a typical situation. If you like it, it could be something for the package's examples/tutorials

ThirstyGeo commented 3 years ago

As a tangent of interest - few research articles are present which relate to imputation of data in the compositional data Simplex. The best one I'm aware of for Deep Learning oriented research for imputing compositional data relates to the specific case of 'censored zeroes', i.e., the values which are below analytical detection and above zero (the only information usual given is that the values are below a certain threshold). The article focusses on ANNs, and has a focus on feature pre-processing (using log-ratio transformations on the features, to move them out of the Simplex and into Euclidean space).

The autoencoder approach of MIDASpy has the significant potential advantages of (1) allowing mixed data types, (2) not requiring a pre-processing step, (3) producing multiple realisations and therefore a measure of confidence for imputed values. Very exciting!

ranjitlall commented 3 years ago

Really interesting - thanks @ThirstyGeo for letting us know about this research.

geraldine28 commented 2 years ago

Hello and thank you for this great package. I wanted to inquire whether you have had any progress on this issue? We have a data set with a lot of count data variables, and many of them get imputed with negative values, which isn't ideal. Hence, our interest :)

kblnig commented 1 year ago

Any news on this? Or maybe a small idea on how or where this would fit best in the code if i were to toy around with it myself? :)

ranjitlall commented 1 year ago

Hi @geraldine28 @kblnig, we are looking into this now and will get back to you shortly. Sorry about the delay!

kblnig commented 1 year ago

@ranjitlall - really looking forward to this :) !!!!

martin18d commented 2 months ago

Echoing others' enthusiasm, I'm also wondering if there's any news on this feature

AuSpotter commented 2 months ago

Looking forward to this feature!

tsrobinson commented 2 months ago

Thanks everyone for your interest! I can confirm this is now under development, and will update you asap when this functionality is ready for release.