awslabs / datawig

Imputation of missing values in tables.
Apache License 2.0
478 stars 69 forks source link

Question) Getting Imputation Weight #141

Closed hyojin0912 closed 4 years ago

hyojin0912 commented 4 years ago

Thanks for your nice package.

I have one question.

I am imputing large matrix (90,000 by 7,000).

And this matrix contain lots of NA (Over 80%).

Also include numerical value and zero or one categorical value.

Below is my code (After loading whole dataframe to impute) ` import datawig

    with tf.device(d):
        df = datawig.SimpleImputer.complete(df, inplace=True, num_epochs=max_epoch, verbose=1, output_path=result_dir+ str(num_seed)+'seed_imputer_model')
        with open(result_dir+str(num_seed)+"seed_Imputed_merged_cid.pickle", 'wb') as handle:
            pickle.dump(merged_cid, handle, protocol=pickle.HIGHEST_PROTOCOL)

        pd.DataFrame(df).to_csv(result_dir+ str(num_seed) + 'seed_Imputed_merged_cid.csv', index=None)`

I use "datawig.SimpleImputer.complete" for simplicity,

but is there any method to get neural network weight which used for imputation. And "datawig.SimpleImputer.complete" function how works for train and validation

I asking because there is no decrease of accuracy

2020-10-27 11:14:22,355 [INFO] Epoch[49] Batch [0-34] Speed: 1651.71 samples/sec cross-entropy=0.515578 C0040436-accuracy=0.000000 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-cross-entropy=0.667427 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-C0040436-accuracy=0.000000 2020-10-27 11:14:22,676 [INFO] Epoch[49] Time cost=0.657 2020-10-27 11:14:22,688 [INFO] Saved checkpoint to "result/dtip/impute/datawig/1000seed_imputer_model/C0040436/model-0049.params" 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-cross-entropy=0.492388 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-C0040436-accuracy=0.000000

Thanks

Hyojin

felixbiessmann commented 4 years ago

Hi

I'm not sure I fully understand how to solve this, but here are a couple remarks:

I would strongly recommend to check the metrics of your model, imputation results should be treated with care if the metrics indicate such a low accuracy as in your case. The predict function of SimpleImputer has a precision_threshold for categorical values, that ensures that you'll only get high precision imputations.

Hope this helps - feel free to reopen otherwise

hyojin0912 commented 4 years ago

Thanks for your kind reply.

But still things to ask.

  1. Typo. I can't understand why accuracy doesn't increase evenif loss decreases.
  2. How should I set up "precision_threshold" in my case (Related to 1.)
  3. And is there any recommendation for paralyzing mxnet when using tensorflow as backend I used below code. for d in ['/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']: with tf.device(d):

I ask because I spend more than days in below state. There must be error.. 2020-10-27 20:38:31,079 [INFO] Saved checkpoint to "result/dtip/impute/datawig/1000seed_imputer_model/C0344329/model-0036.params" 2020-10-27 20:38:31,136 [INFO] No improvement detected for 20 epochs compared to 1.0773332220560405 last error obtained: 5.240848921006545, stopping her 2020-10-27 20:38:31,136 [INFO] ========== done (33.13334774971008 s) fit model

I uploaded merged_cid.csv that I used in upper code as merged_cid (=df for "SimpleImputer.complete")

merged_cid.zip

Thanks

felixbiessmann commented 4 years ago

The cross-entropy can still change when the accuracy doesn't, the cross-entropy is just a finer grained loss

the precision threshold is a standard parameter of SimpleImputer.predict

mxnet and tensorflow are usually not combined, you pick one or the other.

hyojin0912 commented 4 years ago

Thank you for fast reply.

I understand everythings.

Then, isn't there your guess about my zero accuracy? When seeing my metrics which contains lots of NA

felixbiessmann commented 4 years ago

hm, i'd probably use the SimpleImputer.fit/predict approach for single columns (like complete does, but writing the for loop through the columns yourself, because in complete, the metrics/log dir is deleted immediately) and then check the metrics files to see which columns can actually be predicted well enough.