The Problem
Could get a data snooping bias by using the entire dataset for training.
Solution outline
Create a script that splits the data into test and train data. Since the dataset is big enough (> 100,000 tracks) we can do this randomly.
Ensure that the test set is always the same, so that it's always the same one being generated (otherwise, over time, the ML algorithm will get to see the whole dataset).
This could be done by computing a hash of each track_id, or using scikit-learn methods.
The Problem Could get a data snooping bias by using the entire dataset for training.
Solution outline
track_id
, or usingscikit-learn
methods.Acceptance Criteria