awslabs / datawig

Imputation of missing values in tables.
Apache License 2.0
478 stars 69 forks source link

datawig.SimpleImputer.complete not imputing some columns #144

Closed elopezfune closed 3 years ago

elopezfune commented 3 years ago

I am working on some missing values problem with datawig (I am new to it), where from a total of 19 features in a pandas dataframe with missing data, only 4 of them are not fully imputed.

I do:

import datawig

# impute missing values
dataframe = datawig.SimpleImputer.complete(dataframe)

and I get the following error message:

/home/user/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

What's happening and how could I impute the rest of the features?

felixbiessmann commented 3 years ago

Looks like this just a warning, not an error, the code runs through and returns a dataframe, right?

It looks like there are some values in some column that are very rare. For those classes it's difficult to make high precision imputations.

To avoid low precision imputations, I'd recommend to set the precision_threshold argument to some higher value than 0.0, like for instance 0.8 when calling complete. With a threshold of 0.8, you could expect a precision of 0.8 for the imputed values.

Values that are still missing then cannot be imputed with high enough precision.

Closing this for now, feel free to reopen if more problems come up.

SAMNaqvi1212 commented 2 years ago

I hope this message finds you well. I have been trying to impute missing values in my dataset using datawig library. However when I use datawig library to impute the missing values in my dataset. It imputes each and every other column while leaving behind two columns. Both of the columns are of dtype: object. However, it imputes other object columns. I had tried your recommendation by increasing the precision_threshold = 0.80 which also did not do any good. Any recommendation of making it better. Here is the code along with the visualization of my dataset: df.tail(155). Capture

The code to impute the missing values is as follows: import datawig df = datawig.SimpleImputer.complete(df, precision_threshold=0.80)

df.isnull().sum()
PassengerId       0
HomePlanet        0
CryoSleep         0
Cabin           199
Destination       0
Age               0
VIP               0
RoomService       0
FoodCourt         0
ShoppingMall      0
Spa               0
VRDeck            0
Name            200
Transported       0
dtype: int64

The missing values for the column named Cabin and Name were left and were not imputed for I do not know what reason. Also before applying datawig imputation the number of missing values in Name and Cabin column were the same. Any kind help would be appreciated Thanks!!!!

ioakeim-h commented 2 years ago

I have exactly the same problem. Installed datawig in my conda environment with python 3.7 (because higher versions result to problems with mxnet). I downgraded numpy because I got an error after installation: ERROR: mxnet 1.4.0 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.17.2 which is incompatible.

Next, I tried to impute 3 columns from the titanic dataset using datawig.SimpleImputer.complete(df, precision_threshold = 0.8, inplace=True)

image

Got a value error: ValueError: fill value must be in categories

So I forced all columns to string type and then converted "nan" values to np.nan. Then I ran again and only "Embarked" was imputed:

image

I repeated the same steps with precision_threshold = 0.1 and in Colab with the same result.

Is this how datawig should work or am I missing something?