byu-dml / d3m-profiler

MIT License
0 stars 0 forks source link

Train/test splits in `example.py` script have test data bleeding into training set #7

Closed e13h closed 3 years ago

e13h commented 4 years ago

Since rebalancing happens on all the data, this portion of code that splits training and test sets is letting in variations of test data into the training set. This is because it sets aside all synthetic data before splitting into train/test sets, then it appends all synthetic data to the training set.

https://github.com/byu-dml/d3m-profiler/blob/9a3bc45061267091b0109f2159648785e370a18b/example.py#L63-L75

Perhaps a more accurate way of splitting and balancing data would be to do it in the following order:

  1. Split organic data into train/test sets
  2. Balance training data
  3. Train model and get predictions
e13h commented 3 years ago

Closed because of staleness