Since rebalancing happens on all the data, this portion of code that splits training and test sets is letting in variations of test data into the training set. This is because it sets aside all synthetic data before splitting into train/test sets, then it appends all synthetic data to the training set.
Since rebalancing happens on all the data, this portion of code that splits training and test sets is letting in variations of test data into the training set. This is because it sets aside all synthetic data before splitting into train/test sets, then it appends all synthetic data to the training set.
https://github.com/byu-dml/d3m-profiler/blob/9a3bc45061267091b0109f2159648785e370a18b/example.py#L63-L75
Perhaps a more accurate way of splitting and balancing data would be to do it in the following order: