Feature - Split the data into test and train - Githubissues

ayahusseini / spotify_success

Analysing Spotify song data to forecast success

MIT License

1 stars 0 forks source link

Feature - Split the data into test and train #8

Closed ayahusseini closed 3 weeks ago

ayahusseini commented 3 weeks ago

The Problem Could get a data snooping bias by using the entire dataset for training.

Solution outline

Create a script that splits the data into test and train data. Since the dataset is big enough (> 100,000 tracks) we can do this randomly.
Ensure that the test set is always the same, so that it's always the same one being generated (otherwise, over time, the ML algorithm will get to see the whole dataset).
This could be done by computing a hash of each track_id, or using scikit-learn methods.

Acceptance Criteria

[x] Test-train split with a ratio of 0.2
[x] Ensure idempotency