Query Questions - Githubissues

bayesrules commented 10 years ago

Without dropping to the SQL layer, is there a way to obtain aggregate team statistics? For instance, I'd like to know the home team's rushing yards in a given game.

Additionally, I'd be curious to know the stadium a given game is played in. Ultimately, I'd like to derive a field which determines whether a given game is truly a home game for the home team.

Thanks in advance.

BurntSushi commented 10 years ago

Great question! There isn't anything in the API that will directly give you aggregated team statistics, but that's because it's very easy to do. For example, rushing yards for the Patriots in week 17 last season:

import nfldb

db = nfldb.connect()
q = nfldb.Query(db)

q.game(season_year=2013, season_type='Regular', week=17, team='NE')
# An important line! Without it, you'll get `rushing_yds` from both teams.
q.play_player(team='NE')
pps = q.as_aggregate()
print sum(pp.rushing_yds for pp in pps)

Of course, this isn't as efficient as doing it in SQL.

If you have any ideas on how this might be added to the API proper, I might be open to them. (However, I do like the simplicity of the API now.)

Additionally, I'd be curious to know the stadium a given game is played in.

This isn't part of the source data unfortunately. However, I think it may exist in other places. For example: http://www.nflgsis.com/2013/Reg/17/56084/Gamebook.xml --- Look in that first GamebookSummary tag for the Stadium attribute. There's a lot of other interesting data there too, like the weather and referee names.

You can trivially construct those URLs using the gamekey attribute on a nfldb.Game object.

ochawkeye commented 10 years ago

There's a lot of other interesting data there too, like the weather and referee names.

Oh man, it's a good thing nfldb doesn't extend to college football. I'd be clamoring for adding this info just to prove once and for all that Dick Witvoet 1) has it out for the Hawkeyes and 2) is a total Michigan homer.

BurntSushi commented 10 years ago

@ochawkeye You're back! :heart:

Hmm. College football. I don't really follow it myself, but I wonder if there are any data sources... There certainly would be a lot more of it!

albertlyu commented 10 years ago

@BurntSushi and @ochawkeye, there are certainly data sources for college football. I've wanted to do this for some time now, and the most comprehensive I've seen are SI's json data feeds. But I've only just recently learned Python web scraping, and need to improve my skills. Maybe you guys can help out? :)

Edit: It appears SI has changed their entire website design, and now it's very wieldy for me to navigate the scoreboards. It also appears that they've completely redesigned their server-side code in how they serve the JSON files :(

Example: http://www.si.com/game/495336/box-score http://www.si.com/game/495336/play-by-play?json=1

This doesn't make it any easier to scrape college football data. Turns out that they've made this change across their entire site. I used to scrape college basketball data from JSON feeds that look like the below, but now the link is broken: data.sportsillustrated.cnn.com/jsonp/basketball/ncaa/men/gameflash/2014/03/23/93932_playbyplay.json

gojonesy commented 10 years ago

@BurntSushi Have you done any data verification between the json data and the Gamebook.xml data? I wasn't aware that those were available. Any idea if that contains the corrected and final stats for a game?

BurntSushi commented 10 years ago

@gojonesy If only I knew how :P The plays look like this:

(1:14) 29-L.Blount right tackle to BUF 11 for no gain (96-S.Charles). FUMBLES (96-S.Charles), recovered by NE-88-M.Mulligan at BUF 15. 88-M.Mulligan to BUF 15 for no gain (50-K.Alonso).

The holy grail is being able to parse any play description like this and turning it into structured data (with sufficient granularity to be in nfldb). I've been thinking about it for a long time and it is a hard problem. If we had such a tool, then I could add data going back to 1999.

ochawkeye commented 10 years ago

@BurntSushi Yep :) Happened to have a kid right at the time you were fielding input on Python3 and conversation had died down by the time I returned...

I have a feeling @gojonesy is referring to to using this sort of data to highlight irregularities:

<RusherVisitor Player="F.Jackson" Attempts="9" Yards="47" Average="5.2" Touchdowns="0" Long="16" UniformNumber="22"/>
<ReceiverVisitor Player="F.Jackson" Number="1" Yards="13" Average="13.0" Touchdowns="0" Long="13" UniformNumber="22" PassTarget="1"/>

Aggregating those stats and comparing to aggregate nflgame stats shouldn't be too big of a problem, though pinpointing the cause of the irregularity would require that "natural language" deconstruction of the play-by-play.

gojonesy commented 10 years ago

@BurntSushi @ochawkeye The Gamebook actually contains the aggregates for the game already:

<TeamStatistics Description="TOTAL FIRST DOWNS" VisitorStats="19" HomeStats="24"/>
<TeamStatistics Description="By Rushing" VisitorStats="8" HomeStats="12"/>
<TeamStatistics Description="By Passing" VisitorStats="10" HomeStats="7"/>
<TeamStatistics Description="By Penalty" VisitorStats="1" HomeStats="5"/>
<TeamStatistics Description="THIRD DOWN EFFICIENCY" VisitorStats="4-13-31%" HomeStats="4-13-31%"/>
<TeamStatistics Description="FOURTH DOWN EFFICIENCY" VisitorStats="0-3-0%" HomeStats="0-0-0%"/>

etc. That would make the job a little easier...I look into it a bit more after lunch.

BurntSushi commented 10 years ago

@ochawkeye Congrats on the kid! :-) I'm very happy to see you return. :-)

@gojonesy Yeah, that's aggregated data. It's certainly useful and it would definitely be worth using it as a test. But as @ochawkeye said, it doesn't totally solve our problem. I plan on working on testing at some point this season, and those XML files will undoubtedly be part of it. (Unless it turns out that they are exactly derived from the JSON feed---or the reverse---in which case they'll be useless for testing.)

At the very least, it could help us pinpoint games (or categories) where statistics are off.

bayesrules commented 10 years ago

@BurntSushi Thanks for both responses.

I can think of some funny analyses that could be completed with the referee data.

I don't know much about Python or SQL, unfortunately. My use case requires a large amount of "box score" level data, so looping through like that wouldn't work very well.

I ended up just importing all the data into R using the RPostgreSQL package. An example aggregation looked like this:

test <- aggregate(cbind(passing_yds,rushing_yds)~gsis_id+team,data=nfldb$play_player,FUN=sum,na.rm=TRUE)

BurntSushi commented 10 years ago

@bayesrules Ah, I see. Caching aggregated stats for box score data is definitely a good approach.

bayesrules commented 10 years ago

It feels like NFLDB was written with fantasy football in mind. Nothing wrong with that, but I need to make some adjustments for my purposes.

Thanks again for putting all this together. I'm very optimistic about where my analysis could go now.

BurntSushi commented 10 years ago

Well, fantasy football requires summing team statistics as well. nfldb is designed for storing and retrieving NFL statistics. There really isn't anything specific to fantasy football included in nfldb.

The API shouldn't do everything for you. If you need every drop of performance, I think it's reasonable to expect to write some SQL. On the other hand, I'm open to adding things to the API, if it's done elegantly.

gojonesy commented 10 years ago

I really didn't know where to put this, but I am going through the process of analyzing the Gamebooks and wanted to share the script I wrote to download them all...Since we discussed it here, I thought this may be a good place for it.

My first use for these will be to validate aggregated season totals that I am getting with nfldb. I'm mostly interested right now in analyzing things like Passing Efficiency, Rushing Efficiency, Turnover Differential, Penalty Yards, etc. I have found that those counts don't always mesh with some of the stats available at some of the popular statistic sites.

Hopefully, the Gamebooks will help to shed some light on why the totals are off at times.

import nfldb, nflgame, urllib, os

def get_team_gamebooks(year, s_type, team):
    db = nfldb.connect()
    q = nfldb.Query(db)

    q.game(season_year=year, season_type=s_type, team=team)
    game_list = []
    for g in q.as_games():
        game_list.append([g.gamekey, g.week])

    return game_list

db = nfldb.connect()
q = nfldb.Query(db)
plays = 0
year = 2013
s_year = str(year)
s_type = 'Reg'
week = 0
for t in nflgame.teams:
    path = "/your/path/Dev/nfldb/gamebooks/" + s_year

    if not os.path.exists(path):
        os.makedirs(path)

    game_list = get_team_gamebooks(year, 'Regular', t[0])
    for g in game_list:

        # build the url
        if len(str(g[1])) < 2:
            week = "0" + str(g[1])
        else:
            week = str(g[1])

        xmlurl = 'http://www.nflgsis.com/' + str(year) + "/" + s_type + "/" + week + "/" + str(g[0]) + "/Gamebook.xml"

        xmlpath = path + "/" + g[0] + ".xml"
        # Since we are iterating by team, gamekeys will appear twice (once for each team in the game).
        # We only need to retrieve the file one time.
        if not os.path.exists(xmlpath):
            urllib.urlretrieve(xmlurl, xmlpath)

ochawkeye commented 10 years ago

@gojonesy Bit of a crossover conversation, but this overlaps a bit with some of the work @BurntSushi did to validate season long totals a couple of years ago: nflgame test

Both before and after that testing, it was acknowledged that nflgame data was not 100% accurate because the data being fed into nflgame was not 100% accurate. Comparing the total season gamebook statistics to total season nflgame/nfldb statistics will likely tell us what we already know (assuming the source of those total season statistics is different); that would be step one in a many stepped process.

Highlight total season statistic mismatches
If season mismatch found, find individual game mismatch that contributes to that season mismatch
If game mismatch found, find individual play mismatch that contributes to that game mismatch

The ability exists to find 1. & 2. But without being able to pinpoint the exact play that causes the mismatch, there's not really a piece of data in the data base that we can update to help eliminate the mismatch.

Just a personal observation - and I am the farthest thing from a python function naming expert! - but your function name get_team_gamebooks is a bit confusing since it doesn't actually get or return the team gamebooks ;) Talk Amongst Yourselves

gojonesy commented 10 years ago

Ha! I had more in the function to begin with and just never changed the name. Thanks for pointing that out!

Noting that stats are changed after games are played, is there a need/desire for data that is "final"? Would it be beneficial to anyone else to have this data available in nfldb as well?

lokivog commented 10 years ago

@BurntSushi I know this issue i closed but seemed like an appropriate place to comment. First off, love the project, thanks for continuing to improve it. Looking forward to another Fantasy season. In regards to the aggregation of data, have you ever looked at using any tools for viewing the aggregated data in a friendly web interface? Last season I aggregated the data from nflgame and nfldb to elasticsearch for easy viewing and data analysis using http://www.elasticsearch.org/overview/kibana. Kibana was designed for log aggregation but I found it equally helpful for viewing, filtering, searching, and analyzing statistics. I'd be happy to share more once I get the 2014 season up and running with Kibana if this sort of thing is something you've been thinking about evolving to with the project.

BurntSushi commented 10 years ago

@gojonesy In an ideal world, stats don't change after a game is over. As far as the JSON feed goes, this is probably true. (Some of the recent bundles of missing data were my fault, I think, and not a result of an initial incomplete JSON feed.) I am quite certain that official stat corrections are not applied to the JSON feed. Once again, the issue with stat corrections is that it is hard to get your hands on meaningful data. We need to know the exact play in which the correction is issued to fix it.

I am extremely reluctant to add a different source of aggregated data to nfldb. I did this for nflgame and it is my biggest regret, where you can retrieve aggregated game totals from the JSON feed or sum up stats from each play (Yes, they can be different! The JSON feed itself is not even internally consistent.) It adds significantly to API complexity. It is probably the biggest point of confusion with people using nflgame.

I know it doesn't feel good, but until we have the ability to precisely parse human readable play descriptions, I don't think there is really much else that can be done as far as fixing the data goes.

BurntSushi commented 10 years ago

@albertlyu Looks like the play-by-play data is still available: http://www.si.com/pbp/liveupdate?json=1&sport=football%2Fcfb&id=1300292&box=true&pbp=true&linescore=true --- It definitely looks like it has enough granularity to fit into nfldb, but it would require some reverse engineering.

BurntSushi commented 10 years ago

@lokivog

In regards to the aggregation of data, have you ever looked at using any tools for viewing the aggregated data in a friendly web interface?

A web interface is a big hammer. Have you seen my nflcmd project? Its specialty is viewing and sorting aggregated data. Here are some examples:

[andrew@Liger nflcmd] nflstats --weeks 1-4 --year 2013 tommy brady
Player matched: Tom Brady (NE, QB)
  Week  W/L    Date   OPP  CMP  P Att     %  P Yds  Y/Att  P TDs  INT  R Att  R Yds  Y/Att  R TDs  F Lost
---------------------------------------------------------------------------------------------------------
     1    W   Sep 8  @BUF   29     52  55.8    288    5.5      2    1      5     -4   -0.8      0       1
---------------------------------------------------------------------------------------------------------
     2    W  Sep 12   NYJ   19     39  48.7    185    4.7      1    0      2     -2   -1.0      0       0
---------------------------------------------------------------------------------------------------------
     3    W  Sep 22    TB   24     34  70.6    220    6.5      2    1      5      5    1.0      0       0
---------------------------------------------------------------------------------------------------------
     4    W  Sep 29  @ATL   20     31  64.5    316   10.2      2    0      5     -2   -0.4      0       0
---------------------------------------------------------------------------------------------------------
     -    -       -     -   92    156  59.0   1009    6.5      7    2     17     -3   -0.2      0       1

Or for aggregating by season:

[andrew@Liger nflcmd] nflstats --season t brady      
Player matched: Tom Brady (NE, QB)
  Year  Team   G   CMP  P Att     %  P Yds  Y/Att    Y/G  300+  P TDs  INT  R Att  R Yds  Y/Att  R TDs  F Lost
--------------------------------------------------------------------------------------------------------------
  2009    NE  16   371    565  65.7   4398    7.8  274.9     7     28   13     29     44    1.5      1       2
--------------------------------------------------------------------------------------------------------------
  2010    NE  16   324    493  65.7   3900    7.9  243.8     4     36    4     31     30    1.0      1       1
--------------------------------------------------------------------------------------------------------------
  2011    NE  16   401    611  65.6   5239    8.6  327.4    11     39   12     43    109    2.5      3       2
--------------------------------------------------------------------------------------------------------------
  2012    NE  16   400    637  62.8   4799    7.5  299.9     7     34    8     23     32    1.4      4       0
--------------------------------------------------------------------------------------------------------------
  2013    NE  16   379    626  60.5   4338    6.9  271.1     6     25   11     33     18    0.5      0       3
--------------------------------------------------------------------------------------------------------------
     -    NE  80  1875   2932  63.9  22674    7.7  283.4    35    162   48    159    233    1.5      9       8

Notice the fuzzy player name matching. :-)

You can also rank statistics:

[andrew@Liger nflcmd] nflrank --years 2013 --limit 5 fumbles_lost 
              Player  Team   G  F Lost
--------------------------------------
      Peyton Manning   DEN  16       6
--------------------------------------
      Ryan Tannehill   MIA  16       5
--------------------------------------
      Russell Wilson   SEA  16       5
--------------------------------------
          Matt Flynn    GB   7       5
--------------------------------------
  Ben Roethlisberger   PIT  16       5

With all of that said, I am working on a web interface. It will be for integrating fantasy football and broadcast footage mostly, but searching and filtering aggregated statistics will probably be part of it. You can monitor that progress here: https://github.com/BurntSushi/nflfan (It is not in any way, shape or form ready to be used yet.)

BurntSushi commented 10 years ago

Of course, I think we would definitely love to here about your experience with ElasticSearch. Perhaps you could create a wiki page and tell us about it? :-)

BurntSushi / nfldb

Query Questions #30