6758-Project / hockey

0 stars 0 forks source link

M2 Feature Engineering I: Data Augmentation & Train/Val/Test Split #33

Closed JakeColor closed 2 years ago

JakeColor commented 2 years ago

This PR:

Data Pipeline Structure

I structured our data pipeline to flow from raw parsing to splitting to preprocessing (i.e. preparation for training) because:

Train/Val Split Strategy

I implemented the train/val split as random 80/20 sample at the game level, stratified by subseason (regular and post). My logic here is postseason play may be slightly different than regular season, so we want representative portions in both train & val

Test data is the entire 2019 season as instructed.

Notes:

Running

To emulate my results, run the following commands in our conda env:

python src/data/download_data.py --seasons 2015
python src/data/tidy_data.py
python src/data/split_data_for_training.py
python src/data/process_data_for_training.py 

Viz

shots-by-distance