eltonlaw / impyute

Data imputations library to preprocess datasets with missing data
http://impyute.readthedocs.io/
MIT License
354 stars 49 forks source link

[DDFG] Complete MNAR missingness generation #63

Open mm-abogdan opened 5 years ago

mm-abogdan commented 5 years ago

Complete mnar method in the Corruptor class.

Simplified, MNAR (Missing Not at Random) is a type of missingness in which the probability of a value being missing is conditional (in whole or in part) on unobserved data. Missingness may be simultaneously conditional on observed data in addition to unobserved data.

Implementation: Generate a random selection of new features and base missingness on these features. The number of features to generate may be based on some fraction of the existing features, or a random number between 1 - n_features. These features could (should?) be a mix of continuous & categorical; this could be based on the fraction of each respective feature type in the existing features. Once generated, impose missingness based on these new features.

Be sure that functions accept & return matrices. Be sure to follow the 4 steps outlined in contributing.md

The below labels are for DDFG (Data Days for Good) participant reference: Priority: High Difficulty: Medium

https://github.com/eltonlaw/impyute/blob/2c25368576558374d385293f65c883a91dff5027/impyute/dataset/corrupt.py#L48-L50