BurntSushi / nfldb

A library to manage and update NFL data in a relational database.
The Unlicense
1.08k stars 264 forks source link

offense_yds aggregate double counts passing/recieving yds #85

Open hvivian opened 9 years ago

hvivian commented 9 years ago

According to the nfldb wiki, offense_yds are counted by summing

"nfldb.PlayPlayer.passing_yds, nfldb.PlayPlayer.rushing_yds, nfldb.PlayPlayer.receiving_yds and nfldb.PlayPlayer.fumbles_rec_yds".
However, passing yards == receiving yards, and counting both has the effect of inflating total yardage when aggregating PlayPlayers for multiple Players (for example, attempting to count the total yardage of a team over an entire game).

db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=2014, season_type='Regular', week=16)
q.game(home_team='CIN').play(pos_team='CIN')

agg = q.as_aggregate() 

total_yds = sum([play.offense_yds for play in agg])
total_yds_true = sum([play.rushing_yds + play.receiving_yds + play.fumbles_rec_yds for play in agg])

print total_yds
print total_yds_true

Results in:

499
353

The ESPN box score agrees that the second result is accurate.

ochawkeye commented 9 years ago

Aggregating individual player statistics isn't matching your expectation here, but I'm not sure how this one might be addressed other than documenting how the aggregated data could/should be used. offense_yds is derived from PlayPlayer statistics and as such wouldn't be the best candidate for calculating Play statistics.

You are correct that when you try to aggregate over multiple players that the total offense_yds breaks down, but any change along the lines of what you propose breaks aggregating over a single player which, in my opinion, is what the derived statistic's primary use is.

import nfldb

db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=2014, season_type='Regular', week=16)
q.game(home_team='CIN').play(pos_team='CIN')
q.player(full_name='Andy Dalton')
agg = q.as_aggregate()

total_yds = sum([play.offense_yds for play in agg])
total_yds_true = sum([play.rushing_yds + play.receiving_yds + play.fumbles_rec_yds for play in agg])

print total_yds
print total_yds_true
171
25

In week 3 2014, Andy Dalton had 169 yards passing, 3 yards rushing, and 18 yards receiving. The only way to arrive at 190 total yards for the game is to use all statistics that currently go into offense_yds.

I think if you are harvesting full team stats for a game, then you might have to sum the play yards yourself rather than rely upon individual player statistics to add up to the number you are looking for.

hvivian commented 9 years ago

Makes sense, thanks for clearing that up. I hadn't really considered that use case.