awslabs / datawig

Imputation of missing values in tables.
Apache License 2.0
478 stars 69 forks source link

The dataset you're doing the examples on, doesn't have missing values #74

Closed wmlba closed 5 years ago

wmlba commented 5 years ago

I might be missing something but I couldn't find any missing values on the dataset you're doing the example on.

felixbiessmann commented 5 years ago

That's correct, there are no missing values in the example data sets.

That's because in order to evaluate the imputation methods, we are doing cross-validation.

Cross-validation means we split a table into non-overlapping subsets of rows and then train an imputation model on one part of the rows and test it on another part of the rows. If we would run the tests with actual missing values, we would not be able to evaluate the model.

You're right in that in practice one would use the model on rows in which there are missing values. Does that make sense?

wmlba commented 5 years ago

@felixbiessmann Thanks for the reply. Does it mean that if I have a dataframe with missing values, i should train the model on the split of the non-missing values then use the model to predict/impute the missing ones?

Or should I feed it the whole dataset (Dataframe) and It will train on the observations that doesn't have missing values?

It would be great if you can include a more practical example of a DS with missing values.

felixbiessmann commented 5 years ago

Sorry for the confusion, you're absolutely right, we have to work on the examples.

At training time you don't need to worry about whether or not the data frame contains missing values, the training procedure will use the observed values only as labels, rows in which the to-be-imputed column has missing values will be discarded.

ayokha commented 5 years ago

The model looks like it doesn't support dataFrames with missing values. It worked with a csv table without missing values but once I run it with the df with missing values, it trains and then returns : ValueError: Classification metrics can't handle a mix of unknown and multiclass targets

Please help !! I work on a 2 columns csv table et trying to impute the missing values on the second column

felixbiessmann commented 5 years ago

Could you please provide an example or description that would allow to reproduce the error? It sounds like your data frame has missing values encoded in a way that cannot be dealt with by data wig.

On 6. Oct 2019, at 14:38, ayokha notifications@github.com wrote:

The model looks like it doesn't support dataFrames with missing values. It worked with a csv table without missing values but once I run it with the df with missing values, it trains and then returns : ValueError: Classification metrics can't handle a mix of unknown and multiclass targets

Please help !! I work on a 2 columns csv table et trying to impute the missing values on the second column

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.