Describe the data model

BurntSushi commented 12 years ago

The data model for the GameCenter JSON data desperately needs to be described. The fact that there are three different ways to access player statistics (game level, play level and combined) will be baffling to new users. This may be somewhat addressable in the API, but this needs to be explained in detail.

The data model should describe the relationship between Game, Drive, Play, Player and {Game,Play}PlayerStats objects. It should also describe how statistics are computed in play-by-play data using the nflgame.statmap module.

chrislkeller commented 11 years ago

Hi... I have to say they work you have done is amazing. I've only written a few scripts with the library, but am interested in getting some of the data loaded into a backend, and if I can help document things in any way I'd like to volunteer.

Is it safe to say that I can learn the data model through the csv output you've added? For instance, I could run this:

    def season_stats_search():
        game = nflgame.one(2011, 17, "NE", "BUF")
        game.players.csv('player-stats.csv')

To basically get the PlayerStats model:

    name,
    id,
    home,
    team,
    pos,
    defense_ast,
    defense_ffum,
    defense_int,
    defense_sk,
    defense_tkl,
    fumbles_lost,
    fumbles_rcv,
    fumbles_tot,
    fumbles_trcv,
    fumbles_yds,
    kicking_fga,
    kicking_fgm,
    kicking_fgyds,
    kicking_totpfg,
    kicking_xpa,
    kicking_xpb,
    kicking_xpmade,
    kicking_xpmissed,
    kicking_xptot,
    kickret_avg,
    kickret_lng,
    kickret_lngtd,
    kickret_ret,
    kickret_tds,
    passing_att,
    passing_cmp,
    passing_ints,
    passing_tds,
    passing_twopta,
    passing_twoptm,
    passing_yds,
    punting_avg,
    punting_i20,
    punting_lng,
    punting_pts,
    punting_yds,
    puntret_avg,
    puntret_lng,
    puntret_lngtd,
    puntret_ret,
    puntret_tds,
    receiving_lng,
    receiving_lngtd,
    receiving_rec,
    receiving_tds,
    receiving_twopta,
    receiving_twoptm,
    receiving_yds,
    rushing_att,
    rushing_lng,
    rushing_lngtd,
    rushing_tds,
    rushing_twopta,
    rushing_twoptm,
    rushing_yds

BurntSushi commented 11 years ago

@chrislkeller Thanks for your kind offer!

To answer your question, no, the CSV output is not a very good way of learning the data model. The CSV output is a good way of discovering what sorts of data are available and the kinds of values they contain. But there is an even better way: look at the statmap.py data dictionary. It contains each statistical category and a short description of each.

I consider this a subset of the data model. The entire data model would include the following:

A list of statistical fields and what each corresponds to. I have a machine generated CSV of this here, which was directly taken and edited from statmap.py. It is easier to read that statmap.py since it isn't written as a Python data structure.
The relationship between game statistics and play statistics. When should one be used over the other?
combine_max_stats mixes game and play statistics. How?
What is the relationship between a player and a play? (Hint: There are many players for each play, and there are many plays for each player! "Player" is overloaded: there is a player that exists in the world of ideas as a football player, and there is a player that exists in a specific point in time participating in a single play.)
How are cumulative statistics computed over a sequence of plays for each player?
A description of other, less central types like GameClock, PossessionTime and FieldPosition.

I think some of the relationships between essential types will become clear when I create an ER diagram for nfldb's database schema.

I fully expect that describing the data model is something I'll have to do unless you become intimately familiar with the source code.

With that said, if you're looking for lower-hanging fruit to contribute, then doing something for issue #13 would be absolutely fantastic. A new wiki page would be very appropriate. It would be great to be able to link people to examples of using play-by-play data! (Note that there are actually examples littered throughout the issue tracker, but there's no coherent organization to them.)

BurntSushi commented 11 years ago

I just noticed that you said you were interested in getting the data "into a backend." My hope is that I'm trying to accomplish something like that with nfldb using PostgreSQL. My primary goals are to have a simpler API than nflgame and to make it fast. (nflgame will always be slow without a faster internal representation of data, which will probably never happen.) It should come with a program that auto-updates the database via nflgame.

I really really want to have it done before the season starts so that I can use the footage I collected with nflvid to provide the Ultimate Scouting Interface and have more fun with fantasy football drafts. :-)

ochawkeye commented 11 years ago

...so that I can use the footage I collected with nflvid to provide the Ultimate Scouting Interface

GitHub Logo

BurntSushi commented 11 years ago

Yeah, my reaction precisely. But it's going to be tough to get other people in on it. I have just about every single play from the 2011 and 2012 season (all-22 coach footage) on my hard drive. It's a little over 400GB. But I can't redistribute that for obvious legal reasons. (And even if it was legal, that would be the Torrent From Hell.) So other people will have to download their own copies, and it probably takes a few days for an entire season, depending on your connection.

chrislkeller commented 11 years ago

Wow you have a lot of awesome projects in the works... nflvid sounds just cool, though I'd assume one needs a subscription to access the video?

nfldb sounds exactly like what I was looking for as far as a backend.

I agree with your suggestion that working on the wiki is likely more up my alley and allow me to learn much more about the API and what's available... Sounds like a fun project to work on while watching games this fall...

BurntSushi commented 11 years ago

though I'd assume one needs a subscription to access the video?

@chrislkeller - Not at all. Using any capable video player (e.g., vlc), you can watch the HTTP Live Streams for free:

vlc 'http://nlds82.cdnl3nl.neulion.com/nlds_vod/nfl/vod/2012/10/07/55577/2_55577_den_ne_2012_h_whole_1_4500.mp4.m3u8'

Getting the coach footage is a little more complicated since it's an rtmp stream and vlc doesn't seem to handle that as well. But that's what nflvid is there for. :-)

I then use XML data to slice up the footage into plays: http://e2.cdnl3.neulion.com/nfl/edl/nflgr/2012/55577.xml

BurntSushi commented 10 years ago

nfldb's data model is described in its wiki. I'm inclined to think that that is enough. The data is fundamentally the same, except it is more structured in nfldb.

BurntSushi / nflgame

Describe the data model #11