BorisMuzellec / MissingDataOT

A Pytorch implementation of missing data imputation using optimal transport.
95 stars 14 forks source link

Some strange warnings #2

Closed philipperemy closed 3 years ago

philipperemy commented 3 years ago
python experiment.py
/Users/premy/PycharmProjects/MissingDataOT/venv/lib/python3.8/site-packages/sklearn/impute/_iterative.py:685: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached.
  warnings.warn("[IterativeImputer] Early stopping criterion not"
2021-02-26 18:19:56,316 mean imputation:     MAE: 0.8266    RMSE: 0.9814    OT: 0.5446
2021-02-26 18:19:56,318 ice imputation: MAE: 0.4502 RMSE: 0.6515    OT: 0.1419
2021-02-26 18:19:57,793 softimpute: MAE: 0.4773 RMSE: 0.6481    OT: 0.1897
2021-02-26 18:19:57,796 epsilon: 0.1092 (50.0th percentile times 0.05)
2021-02-26 18:19:57,796 Sinkhorn Imputation
2021-02-26 18:19:57,796 Batchsize larger that half size = 75. Setting batchsize to 64.
2021-02-26 18:19:57,797 batchsize = 64, epsilon = 0.1092
2021-02-26 18:19:57,824 Iteration 0:     Loss: 0.2670    Validation MAE: 0.8336 RMSE: 0.9894
2021-02-26 18:20:09,027 Iteration 500:   Loss: 0.0983    Validation MAE: 0.4873 RMSE: 0.6853
2021-02-26 18:20:20,780 Iteration 1000:  Loss: 0.0989    Validation MAE: 0.4797 RMSE: 0.6918
2021-02-26 18:20:33,981 Iteration 1500:  Loss: 0.1439    Validation MAE: 0.4630 RMSE: 0.6870

Should I worry?

BorisMuzellec commented 3 years ago

Hi, nothing to worry about!

There is an early stopping criterion that checks how much the imputations have changed at each iteration (in L2 norm), and stops the algorithm if the difference is below a given threshold. However, there's quite a lot of variance induced by sampling the batches and this criterion is rarely verified in practice (unless you sample a very large number of batch pairs at each iteration). The warning just tells you that the algorithm stopped because the max number of iterations was reached, and not because of this early stopping criterion.

To decide when to stop the algorithm, the best thing is to create a validation set (e.g. by artificially introducing new missing values) and monitor the decrease of MAE and RMSE on this validation set.

philipperemy commented 3 years ago

@BorisMuzellec thank you for the quick answer!