biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

Data leakage? #4240

Closed gengyabc closed 4 years ago

gengyabc commented 4 years ago

Learns auto impute missing values I think this may cause data leakage from training to testing dataset if we do not split the data before hand but use the cross validation/ random sampling in "test and score" widget.

https://github.com/biolab/orange3/blob/b3c5fdf3615173ac81146ab632f55ee9cc1726a7/Orange/base.py#L113

the missing value is imputed before split train and test, which will be done in "test and score" widget

janezd commented 4 years ago

There is no cross validation after that line -- a few lines later this entire data is used for fitting. Test and score widget cannot do anything after that line. Splitting occurrs earlier and preprocessing is applied only to training data: https://github.com/biolab/orange3/blob/master/Orange/evaluation/testing.py#L432.

Of course you have to construct a correct schema:

Screenshot 2019-12-02 at 13 19 01

If you do the following, it's wrong, but the problem is not in the line you mention.

Screenshot 2019-12-02 at 13 20 16