awslabs / datawig

Imputation of missing values in tables.
Apache License 2.0
478 stars 69 forks source link

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). #98

Closed TaichiLi closed 4 years ago

TaichiLi commented 5 years ago

When I predict missing value, I found that the datawig can't predict multiple data. For example,

data_encoder_cols = [NumericalEncoder('a'), NumericalEncoder('c'),
                     NumericalEncoder('e'),NumericalEncoder('g'),NumericalEncoder('h')]
label_encoder_cols = [NumericalEncoder('b'),NumericalEncoder('d'),NumericalEncoder('f')]
data_featurizer_cols = [NumericalFeaturizer('a'), NumericalFeaturizer('c'), NumericalFeaturizer('e'),
                         NumericalFeaturizer('g'), NumericalFeaturizer('h')]

imputer = Imputer(
    data_featurizers=data_featurizer_cols,
    label_encoders=label_encoder_cols,
    data_encoders=data_encoder_cols,
    output_path='imputer_model1'
)

This is my code, I want to get the 'b','d','f', but there will be a error:

Traceback (most recent call last):

  File "<ipython-input-42-15a8b8acfb65>", line 1, in <module>
    runfile('E:/Python/datawig-master/1.py', wdir='E:/Python/datawig-master')

  File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "E:/Python/datawig-master/1.py", line 32, in <module>
    imputer.fit(train_df=df_train,num_epochs=10)

  File "E:\Python\datawig-master\datawig\imputer.py", line 257, in fit
    iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)

  File "E:\Python\datawig-master\datawig\imputer.py", line 564, in __build_iterators
    train_df = self.__drop_missing_labels(train_df, how='all')

  File "E:\Python\datawig-master\datawig\imputer.py", line 935, in __drop_missing_labels
    if missing_idx == -1:

  File "D:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1469, in __nonzero__
    .format(self.__class__.__name__))

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I don't know how to solve it.I want to get some help.

felixbiessmann commented 5 years ago

You can try to train and apply an Imputer for each output column separately.

Alternatively, if you are ok with the default setting used in the SimpleImputer, you can also try to use the convenience function SimpleImputer.complete:

df = SimpleImputer.complete(data_frame=df)

Does that help?

TaichiLi commented 5 years ago

@felixbiessmann I train an Imputer for each column, because I think the SimpleImputer's performance isn't better than Imputer. In addition, I have another question. How do I run with GPU?

felixbiessmann commented 5 years ago

hm, that code you shared should be using exactly the same featurizers, model and hyperparameters as the ones used by SimpleImputer.complete - it would be most interesting, if that gives you better results. The only difference is that the SimpleImputer is less typing. Would you mind sharing a comparison of the precision you're getting with either approach?

As for the GPU setup: You'll need an mxnet installation that works with GPUs. For a start you could try following the instructions on the main readme page of datawig.

If you run this (in your activated virtualenv)

wget https://raw.githubusercontent.com/awslabs/datawig/master/requirements/requirements.gpu-cu${CUDA_VERSION}.txt
pip install datawig --no-deps -r requirements.gpu-cu${CUDA_VERSION}.txt
rm requirements.gpu-cu${CUDA_VERSION}.txt

you should have the required dependency.

datawig should then per default use the available GPUs.

TaichiLi commented 5 years ago

emm, wget is for Linux, but I use Win7.And there is no virtualenv in my computer.In addition, The SimpleImputer can't have parameters data_encoders.

felixbiessmann commented 5 years ago

you can also download the requirement files for GPU: https://github.com/awslabs/datawig/tree/master/requirements

you can install those requirements also without virtualenvs.

Are you trying to run GPU based model training on a Win7 machine? Not sure how good an idea that is.

As for the simple imputer comparison, you don't need to specify those data_encoders, it does that for you, and it does it exactly like you did it in the code you shared.

Just type

df = SimpleImputer.complete(data_frame=df)
TaichiLi commented 5 years ago

Thanks very much.I try to use SimpleImputer, but I think if we use autocoder ,maybe we can get more accurate result.

felixbiessmann commented 5 years ago

hm, possibly. But the code you shared doesn’t use an auto encoder. In fact, none of the DataWig models use auto encoders, as far as I know.

On 22. Mar 2019, at 13:01, TaichiLi notifications@github.com wrote:

Thanks very much.I try to use SimpleImputer, but I think if we use autocoder ,maybe we can get more accurate result.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TaichiLi commented 5 years ago

I look for many codes which can impute missing value, although datawig isn't perfect, it's the best code. I find a code which use autoencoder to impute missing value, but I don't know how to run my data with the code.This is the link