[x] extends the tidy data script to parse all event types
[x] creates a train/val/test split strategy script
[x] creates a baseline preprocessing script
[x] produces requested visualizations
Data Pipeline Structure
I structured our data pipeline to flow from raw parsing to splitting to preprocessing (i.e. preparation for training) because:
aggregation of the train/val/test split files is expensive and shouldn't change often
our preprocessing will change often in baseline and advanced modeling work, and will require access to all event types for future preprocessing work
Train/Val Split Strategy
I implemented the train/val split as random 80/20 sample at the game level, stratified by subseason (regular and post). My logic here is postseason play may be slightly different than regular season, so we want representative portions in both train & val
Test data is the entire 2019 season as instructed.
Notes:
Visualizations associated w/ this task will come in a separate PR
Maximum split file size is ~400Mb, maximum preprocessed file size is ~60Mb
the instructions are ambiguous as to whether or not we should include GOAL events in addition to SHOT; i have a question open on piazza (@219)
Running
To emulate my results, run the following commands in our conda env:
This PR:
Data Pipeline Structure
I structured our data pipeline to flow from raw parsing to splitting to preprocessing (i.e. preparation for training) because:
Train/Val Split Strategy
I implemented the train/val split as random 80/20 sample at the game level, stratified by subseason (regular and post). My logic here is postseason play may be slightly different than regular season, so we want representative portions in both train & val
Test data is the entire 2019 season as instructed.
Notes:
Running
To emulate my results, run the following commands in our conda env:
Viz