Closed bayesrules closed 10 years ago
Great question! There isn't anything in the API that will directly give you aggregated team statistics, but that's because it's very easy to do. For example, rushing yards for the Patriots in week 17 last season:
import nfldb
db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=2013, season_type='Regular', week=17, team='NE')
# An important line! Without it, you'll get `rushing_yds` from both teams.
q.play_player(team='NE')
pps = q.as_aggregate()
print sum(pp.rushing_yds for pp in pps)
Of course, this isn't as efficient as doing it in SQL.
If you have any ideas on how this might be added to the API proper, I might be open to them. (However, I do like the simplicity of the API now.)
Additionally, I'd be curious to know the stadium a given game is played in.
This isn't part of the source data unfortunately. However, I think it may exist in other places. For example: http://www.nflgsis.com/2013/Reg/17/56084/Gamebook.xml --- Look in that first GamebookSummary
tag for the Stadium
attribute. There's a lot of other interesting data there too, like the weather and referee names.
You can trivially construct those URLs using the gamekey
attribute on a nfldb.Game
object.
There's a lot of other interesting data there too, like the weather and referee names.
Oh man, it's a good thing nfldb
doesn't extend to college football. I'd be clamoring for adding this info just to prove once and for all that Dick Witvoet 1) has it out for the Hawkeyes and 2) is a total Michigan homer.
@ochawkeye You're back! :heart:
Hmm. College football. I don't really follow it myself, but I wonder if there are any data sources... There certainly would be a lot more of it!
@BurntSushi and @ochawkeye, there are certainly data sources for college football. I've wanted to do this for some time now, and the most comprehensive I've seen are SI's json data feeds. But I've only just recently learned Python web scraping, and need to improve my skills. Maybe you guys can help out? :)
Edit: It appears SI has changed their entire website design, and now it's very wieldy for me to navigate the scoreboards. It also appears that they've completely redesigned their server-side code in how they serve the JSON files :(
Example: http://www.si.com/game/495336/box-score http://www.si.com/game/495336/play-by-play?json=1
This doesn't make it any easier to scrape college football data. Turns out that they've made this change across their entire site. I used to scrape college basketball data from JSON feeds that look like the below, but now the link is broken: data.sportsillustrated.cnn.com/jsonp/basketball/ncaa/men/gameflash/2014/03/23/93932_playbyplay.json
@BurntSushi Have you done any data verification between the json data and the Gamebook.xml data? I wasn't aware that those were available. Any idea if that contains the corrected and final stats for a game?
@gojonesy If only I knew how :P The plays look like this:
(1:14) 29-L.Blount right tackle to BUF 11 for no gain (96-S.Charles). FUMBLES (96-S.Charles), recovered by NE-88-M.Mulligan at BUF 15. 88-M.Mulligan to BUF 15 for no gain (50-K.Alonso).
The holy grail is being able to parse any play description like this and turning it into structured data (with sufficient granularity to be in nfldb
). I've been thinking about it for a long time and it is a hard problem. If we had such a tool, then I could add data going back to 1999.
@BurntSushi Yep :) Happened to have a kid right at the time you were fielding input on Python3 and conversation had died down by the time I returned...
I have a feeling @gojonesy is referring to to using this sort of data to highlight irregularities:
<RusherVisitor Player="F.Jackson" Attempts="9" Yards="47" Average="5.2" Touchdowns="0" Long="16" UniformNumber="22"/>
<ReceiverVisitor Player="F.Jackson" Number="1" Yards="13" Average="13.0" Touchdowns="0" Long="13" UniformNumber="22" PassTarget="1"/>
Aggregating those stats and comparing to aggregate nflgame
stats shouldn't be too big of a problem, though pinpointing the cause of the irregularity would require that "natural language" deconstruction of the play-by-play.
@BurntSushi @ochawkeye The Gamebook actually contains the aggregates for the game already:
<TeamStatistics Description="TOTAL FIRST DOWNS" VisitorStats="19" HomeStats="24"/>
<TeamStatistics Description="By Rushing" VisitorStats="8" HomeStats="12"/>
<TeamStatistics Description="By Passing" VisitorStats="10" HomeStats="7"/>
<TeamStatistics Description="By Penalty" VisitorStats="1" HomeStats="5"/>
<TeamStatistics Description="THIRD DOWN EFFICIENCY" VisitorStats="4-13-31%" HomeStats="4-13-31%"/>
<TeamStatistics Description="FOURTH DOWN EFFICIENCY" VisitorStats="0-3-0%" HomeStats="0-0-0%"/>
etc. That would make the job a little easier...I look into it a bit more after lunch.
@ochawkeye Congrats on the kid! :-) I'm very happy to see you return. :-)
@gojonesy Yeah, that's aggregated data. It's certainly useful and it would definitely be worth using it as a test. But as @ochawkeye said, it doesn't totally solve our problem. I plan on working on testing at some point this season, and those XML files will undoubtedly be part of it. (Unless it turns out that they are exactly derived from the JSON feed---or the reverse---in which case they'll be useless for testing.)
At the very least, it could help us pinpoint games (or categories) where statistics are off.
@BurntSushi Thanks for both responses.
I can think of some funny analyses that could be completed with the referee data.
I don't know much about Python or SQL, unfortunately. My use case requires a large amount of "box score" level data, so looping through like that wouldn't work very well.
I ended up just importing all the data into R using the RPostgreSQL
package. An example aggregation looked like this:
test <- aggregate(cbind(passing_yds,rushing_yds)~gsis_id+team,data=nfldb$play_player,FUN=sum,na.rm=TRUE)
@bayesrules Ah, I see. Caching aggregated stats for box score data is definitely a good approach.
It feels like NFLDB was written with fantasy football in mind. Nothing wrong with that, but I need to make some adjustments for my purposes.
Thanks again for putting all this together. I'm very optimistic about where my analysis could go now.
Well, fantasy football requires summing team statistics as well. nfldb
is designed for storing and retrieving NFL statistics. There really isn't anything specific to fantasy football included in nfldb
.
The API shouldn't do everything for you. If you need every drop of performance, I think it's reasonable to expect to write some SQL. On the other hand, I'm open to adding things to the API, if it's done elegantly.
I really didn't know where to put this, but I am going through the process of analyzing the Gamebooks and wanted to share the script I wrote to download them all...Since we discussed it here, I thought this may be a good place for it.
My first use for these will be to validate aggregated season totals that I am getting with nfldb. I'm mostly interested right now in analyzing things like Passing Efficiency, Rushing Efficiency, Turnover Differential, Penalty Yards, etc. I have found that those counts don't always mesh with some of the stats available at some of the popular statistic sites.
Hopefully, the Gamebooks will help to shed some light on why the totals are off at times.
import nfldb, nflgame, urllib, os
def get_team_gamebooks(year, s_type, team):
db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=year, season_type=s_type, team=team)
game_list = []
for g in q.as_games():
game_list.append([g.gamekey, g.week])
return game_list
db = nfldb.connect()
q = nfldb.Query(db)
plays = 0
year = 2013
s_year = str(year)
s_type = 'Reg'
week = 0
for t in nflgame.teams:
path = "/your/path/Dev/nfldb/gamebooks/" + s_year
if not os.path.exists(path):
os.makedirs(path)
game_list = get_team_gamebooks(year, 'Regular', t[0])
for g in game_list:
# build the url
if len(str(g[1])) < 2:
week = "0" + str(g[1])
else:
week = str(g[1])
xmlurl = 'http://www.nflgsis.com/' + str(year) + "/" + s_type + "/" + week + "/" + str(g[0]) + "/Gamebook.xml"
xmlpath = path + "/" + g[0] + ".xml"
# Since we are iterating by team, gamekeys will appear twice (once for each team in the game).
# We only need to retrieve the file one time.
if not os.path.exists(xmlpath):
urllib.urlretrieve(xmlurl, xmlpath)
@gojonesy Bit of a crossover conversation, but this overlaps a bit with some of the work @BurntSushi did to validate season long totals a couple of years ago: nflgame test
Both before and after that testing, it was acknowledged that nflgame
data was not 100% accurate because the data being fed into nflgame
was not 100% accurate. Comparing the total season gamebook statistics to total season nflgame/nfldb statistics will likely tell us what we already know (assuming the source of those total season statistics is different); that would be step one in a many stepped process.
The ability exists to find 1. & 2. But without being able to pinpoint the exact play that causes the mismatch, there's not really a piece of data in the data base that we can update to help eliminate the mismatch.
Just a personal observation - and I am the farthest thing from a python function naming expert! - but your function name get_team_gamebooks
is a bit confusing since it doesn't actually get or return the team gamebooks ;)
Ha! I had more in the function to begin with and just never changed the name. Thanks for pointing that out!
Noting that stats are changed after games are played, is there a need/desire for data that is "final"? Would it be beneficial to anyone else to have this data available in nfldb as well?
@BurntSushi I know this issue i closed but seemed like an appropriate place to comment. First off, love the project, thanks for continuing to improve it. Looking forward to another Fantasy season. In regards to the aggregation of data, have you ever looked at using any tools for viewing the aggregated data in a friendly web interface? Last season I aggregated the data from nflgame and nfldb to elasticsearch for easy viewing and data analysis using http://www.elasticsearch.org/overview/kibana. Kibana was designed for log aggregation but I found it equally helpful for viewing, filtering, searching, and analyzing statistics. I'd be happy to share more once I get the 2014 season up and running with Kibana if this sort of thing is something you've been thinking about evolving to with the project.
@gojonesy In an ideal world, stats don't change after a game is over. As far as the JSON feed goes, this is probably true. (Some of the recent bundles of missing data were my fault, I think, and not a result of an initial incomplete JSON feed.) I am quite certain that official stat corrections are not applied to the JSON feed. Once again, the issue with stat corrections is that it is hard to get your hands on meaningful data. We need to know the exact play in which the correction is issued to fix it.
I am extremely reluctant to add a different source of aggregated data to nfldb
. I did this for nflgame
and it is my biggest regret, where you can retrieve aggregated game totals from the JSON feed or sum up stats from each play (Yes, they can be different! The JSON feed itself is not even internally consistent.) It adds significantly to API complexity. It is probably the biggest point of confusion with people using nflgame
.
I know it doesn't feel good, but until we have the ability to precisely parse human readable play descriptions, I don't think there is really much else that can be done as far as fixing the data goes.
@albertlyu Looks like the play-by-play data is still available: http://www.si.com/pbp/liveupdate?json=1&sport=football%2Fcfb&id=1300292&box=true&pbp=true&linescore=true --- It definitely looks like it has enough granularity to fit into nfldb
, but it would require some reverse engineering.
@lokivog
In regards to the aggregation of data, have you ever looked at using any tools for viewing the aggregated data in a friendly web interface?
A web interface is a big hammer. Have you seen my nflcmd project? Its specialty is viewing and sorting aggregated data. Here are some examples:
[andrew@Liger nflcmd] nflstats --weeks 1-4 --year 2013 tommy brady
Player matched: Tom Brady (NE, QB)
Week W/L Date OPP CMP P Att % P Yds Y/Att P TDs INT R Att R Yds Y/Att R TDs F Lost
---------------------------------------------------------------------------------------------------------
1 W Sep 8 @BUF 29 52 55.8 288 5.5 2 1 5 -4 -0.8 0 1
---------------------------------------------------------------------------------------------------------
2 W Sep 12 NYJ 19 39 48.7 185 4.7 1 0 2 -2 -1.0 0 0
---------------------------------------------------------------------------------------------------------
3 W Sep 22 TB 24 34 70.6 220 6.5 2 1 5 5 1.0 0 0
---------------------------------------------------------------------------------------------------------
4 W Sep 29 @ATL 20 31 64.5 316 10.2 2 0 5 -2 -0.4 0 0
---------------------------------------------------------------------------------------------------------
- - - - 92 156 59.0 1009 6.5 7 2 17 -3 -0.2 0 1
Or for aggregating by season:
[andrew@Liger nflcmd] nflstats --season t brady
Player matched: Tom Brady (NE, QB)
Year Team G CMP P Att % P Yds Y/Att Y/G 300+ P TDs INT R Att R Yds Y/Att R TDs F Lost
--------------------------------------------------------------------------------------------------------------
2009 NE 16 371 565 65.7 4398 7.8 274.9 7 28 13 29 44 1.5 1 2
--------------------------------------------------------------------------------------------------------------
2010 NE 16 324 493 65.7 3900 7.9 243.8 4 36 4 31 30 1.0 1 1
--------------------------------------------------------------------------------------------------------------
2011 NE 16 401 611 65.6 5239 8.6 327.4 11 39 12 43 109 2.5 3 2
--------------------------------------------------------------------------------------------------------------
2012 NE 16 400 637 62.8 4799 7.5 299.9 7 34 8 23 32 1.4 4 0
--------------------------------------------------------------------------------------------------------------
2013 NE 16 379 626 60.5 4338 6.9 271.1 6 25 11 33 18 0.5 0 3
--------------------------------------------------------------------------------------------------------------
- NE 80 1875 2932 63.9 22674 7.7 283.4 35 162 48 159 233 1.5 9 8
Notice the fuzzy player name matching. :-)
You can also rank statistics:
[andrew@Liger nflcmd] nflrank --years 2013 --limit 5 fumbles_lost
Player Team G F Lost
--------------------------------------
Peyton Manning DEN 16 6
--------------------------------------
Ryan Tannehill MIA 16 5
--------------------------------------
Russell Wilson SEA 16 5
--------------------------------------
Matt Flynn GB 7 5
--------------------------------------
Ben Roethlisberger PIT 16 5
With all of that said, I am working on a web interface. It will be for integrating fantasy football and broadcast footage mostly, but searching and filtering aggregated statistics will probably be part of it. You can monitor that progress here: https://github.com/BurntSushi/nflfan (It is not in any way, shape or form ready to be used yet.)
Of course, I think we would definitely love to here about your experience with ElasticSearch. Perhaps you could create a wiki page and tell us about it? :-)
Without dropping to the SQL layer, is there a way to obtain aggregate team statistics? For instance, I'd like to know the home team's rushing yards in a given game.
Additionally, I'd be curious to know the stadium a given game is played in. Ultimately, I'd like to derive a field which determines whether a given game is truly a home game for the home team.
Thanks in advance.