M2 Feature Engineering I: Data Augmentation & Train/Val/Test Split

This PR:

[x] extends the tidy data script to parse all event types
[x] creates a train/val/test split strategy script
[x] creates a baseline preprocessing script
[x] produces requested visualizations

Data Pipeline Structure

I structured our data pipeline to flow from raw parsing to splitting to preprocessing (i.e. preparation for training) because:

aggregation of the train/val/test split files is expensive and shouldn't change often
our preprocessing will change often in baseline and advanced modeling work, and will require access to all event types for future preprocessing work

Train/Val Split Strategy

I implemented the train/val split as random 80/20 sample at the game level, stratified by subseason (regular and post). My logic here is postseason play may be slightly different than regular season, so we want representative portions in both train & val

Test data is the entire 2019 season as instructed.

Notes:

Visualizations associated w/ this task will come in a separate PR
Maximum split file size is ~400Mb, maximum preprocessed file size is ~60Mb
the instructions are ambiguous as to whether or not we should include GOAL events in addition to SHOT; i have a question open on piazza (@219)

Running

To emulate my results, run the following commands in our conda env:

python src/data/download_data.py --seasons 2015
python src/data/tidy_data.py
python src/data/split_data_for_training.py
python src/data/process_data_for_training.py

Viz

shots-by-distance

6758-Project / hockey