TaddyLab / hockey

Chicago Hockey Analytics
7 stars 2 forks source link

corsi and fenwick #4

Closed mataddy closed 9 years ago

mataddy commented 9 years ago

@rbgramacy and @sentian, some food for thought on how we sell this.

from the article at http://www.secondcityhockey.com/2013/12/4/5167404/nhl-stats-made-simple-part-1-corsi-fenwick these are just based on shots. the difference is that corsi includes blocked shots.

so when we are doing our regression with shots, is the result like a "regression adjusted fenwick"? And could we do the same thing for corsi by adding blocked shots?

mataddy commented 9 years ago

@sentian: since you've done much more with shots... if you have a chance, can you confirm that the 'shot' event that we are using in design.R conforms with the fenwick definition listed above? ie does it include both on-goal and missed shots? And also what is the code for a blocked shot?

sentian commented 9 years ago

Will do tomorrow. I've been quite busy today.

mataddy commented 9 years ago

No rush at all! Also, feel free to add whatever you want to the analysis. Your and my job is to get creative with analysis while bobby writes. For example, I think you did a bunch of salary comparisons that I haven't replicated.

sentian commented 9 years ago

@mataddy , @rbgramacy: do we have a document telling the event types 'etype' in the 'gamerec' files? I've found etype has 'MISS', 'BLOCK' besides 'SHOT' and 'GOAL'. To check that, I actually compared the record with a game on Youtube, which makes me believe 'BLOCK' is the blocked shots in Corsi and 'SHOT' is the missed shots. However, I'm not too sure what 'MISS' is.

An example is as follows.

The game record shows that in period 1, there's a 'SHOT' by Malkin at 19:18, 'MISS' by Crosby at 17:12, 'BLOCK' (shot by Letang) at 14:55.

The video I've found is here. https://www.youtube.com/watch?v=tQ32LL7UT7Y

If you guys agree with this, I'm adding in 'blocked shots' in the design.R, and do some analysis using Corsi, maybe?

mataddy commented 9 years ago

thanks, nice work. From the video it appears that SHOT is a 'shot on goal' and MISS is a 'missed shot'. Notice that on crosbie's 'MISS' he is knocked off balance and hits the boards far from the goal.

So then, instead of my 'SHOTS' flag we should have something like RESPONSE equal to either 'goal' for just etype=="GOAL"; or 'fenwick' for etype %in% c("GOAL","SHOT","MISS"); or 'corsi' for etype %in% c("GOAL","SHOT","MISS","BLOCK"). Then we can replace the current -shot.csv results with both -fenwick and *-corsi.

Note that in the output (regardless of metric) performance-*.csv file I've added 'fp' which is the 'for percentage': points.for/(points.for + points.away). This is the way that corsi and fenwick tend to be reported, as opposed to the usual PM points-for - points.away. As a probabalistic version of this I added 'prob' based on our betas, which is just prob = 1/(1+exp(-beta)).

mataddy commented 9 years ago

oops, didn't mean to close the issue.

Also, Sen FYI both a 'shot on goal' and a 'missed shot' are 'misses' in the sense that no-one scores, but the 'shot on goal' required a stop from the goalie while the 'missed shot' had no chance.

sentian commented 9 years ago

I forgot I should write it here instead of sending email. @mataddy Can you please send me the 1314 game record data ‘20132014-*-gamerec.txt’. I'm just aware of not having them. I think it shall be around 16 mega bytes. Probably a dropbox link?

mataddy commented 9 years ago

cool, will do. I have the full directory with all games back to 02/03 tar'd in dropbox and can share that; ~1GB so not too big, then we all have the same data.

sentian commented 9 years ago

That's perfect, thx~

mataddy commented 9 years ago

just sent; let me know if you don't see it.

sentian commented 9 years ago

hmm. I think I did not receive it. Did you send to my gmail?

sentian commented 9 years ago

I got the game records. Can you send me the roster data as well? I forgot to mention it. Apologize...

sentian commented 9 years ago

I've just finished running design code for CORSI and FENWICK. My laptop has 8 cores and it took me like 8 hours to run each. They are both very large. The nhldesgin-.rda files are around 450mb each.

Several things I want to check with @mataddy :

  1. In design.R, to get the entry and first year entering the NHL, the code you gave is
entry <- goal$season[XP@i[tail(XP@p,-1)+1]+1]
firstyear <- as.numeric(substr(entry,1,4))

Should it be 'head' instead of 'tail'? But they shouldn't cause trouble since I have not seen them in the performance/salary codes.

  1. Still in design.R, the 'player' data frame has an item 'plus.minus' which is 'colSums(XP)'. So that is actually for that player, when he was on-ice, # of home goals - # away goals. That's not the usual plus.minus we talk about right?
mataddy commented 9 years ago

Hi sen, I think your correct on both of those; these are legacy bugs from the old buildgoals. But since we use none of them downstream then let's just delete them (I recalculate plus minus correctly in performance.R).

can you also update performance.R and run for these new responses? Then we'll have results/performance_corsi.csv, etc

sentian commented 9 years ago

Sure. That's what I'm doing right now.

sentian commented 9 years ago

I've uploaded the results. Maybe we should delete those results in SHOTS? @mataddy I've also shared the nhldesign files for CORSI and FENWICK in case you want to do some other stuffs.

CORSI data: n>1.3 million, FENWICK data: n>1 million A few findings:

  1. Results measure in FENWICK is very similar to results measure in SHOTS.
  2. Daniel Sedin has the highest partial pm, if measure using CORSI. Alex Ovechkin is still the king if measure in FENWICK.
  3. We get nonzero post-season beta differences for both CORSI and FENWICK..
  4. Salary correlations: CORSI gives several negative values in season 0506 and 0607 (right after the lockout season).
mataddy commented 9 years ago

super nice work sen; thanks. I'll delete the 'shots' results (and we've recorded here that results are close to those from fenwick) and close this issue.

I'll also create a new issue for an initial writeup of these models and results. I'm going to assign it to you but it is my responsibility too. I won't be able to devote until mid-next week on this, so anything you can get in there yourself will give me a good headstart for the final writeup.