6758-Project / hockey

0 stars 0 forks source link

Data cleaning ("tidy data") does not account for side-switching #14

Closed JakeColor closed 2 years ago

JakeColor commented 2 years ago

Summary

  1. write a download_data style pipeline to filter the raw play data, apply the above logic, convert to pandas, and write out CSVs to a tidy/ data directory
  2. write a profiling notebook to make sure the distribution of median/average shot locations by period for all teams makes sense

Description

As I explore the data generated by Sara's original shot_maps/tidy_data work, I realized we haven't properly accounted for how teams switch sides during hockey games.

We also didn't actually apply the tidy data transformation to generate a cleaned dataset (in /data/tidy/), so this is a good opportunity to go back and put in place a cleaner foundation for the visualization work we have ahead of us.

Context

Teams switch sides during hockey games: during the 2nd period, they skate in the opposite direction that they did in the 1st and 3rd periods. This is obvious if we look at median shot coordinates for the first period:


median_shot_coordinates = \
     plays[(plays['period']<5)]\
        .groupby(['team_name', 'period'])[['coordinate_x', 'coordinate_y']].median()\
        .reset_index()
team_name period coordinate_x coordinate_y
Anaheim Ducks 1 55.0 -1.0
Anaheim Ducks 2 -56.0 0.5
Anaheim Ducks 3 56.0 0.0
Anaheim Ducks 4 -66.0 0.0
Arizona Coyotes 1 53.0 0.0
Arizona Coyotes 2 -56.0 0.0
Arizona Coyotes 3 54.0 0.0
Arizona Coyotes 4 -64.0 0.0
Boston Bruins 1 51.0 0.0
Boston Bruins 2 -54.0 1.0
Boston Bruins 3 50.0 0.0
Boston Bruins 4 -60.5 1.0

Action

We need to clean our data by transforming event coordinates to always be positive when the event takes place in the event-producing team's offensive zone, and negative when they occur in the defensive zone.

As a first pass, i think we should go rule-based:

Note: It's also possible that some team's arenas built the "home" side on the opposite-side of normal, which we would also have to account for because the logic above would be inverted.

But i'm 100% sure this happens in the NHL, or is already accounted for. If so, It will jump out when we check ourselves after applyg the naive rule-based transformation

salelkafrawy commented 2 years ago

Is a shot allowed from the defensive part of the rink? because I thought that a goal or shot is done from half of the rink so I just flipped all the (x,y) that happens in the other half (e.g. when x is [-100,0]) If a shot/goal can be done from every spot in the rink then we should go your way and see if the home/away will be flipped at anytime.

salelkafrawy commented 2 years ago

and shouldn't the coord_y also be flipped in the case of period%2=0?

JakeColor commented 2 years ago

@saraEbrahim answering your q's:

Is a shot allowed from the defensive part of the rink?

yes

and shouldn't the coord_y also be flipped in the case of period%2=0?

good observation, i agree!

JakeColor commented 2 years ago

solved by #13