BurntSushi / nfldb

A library to manage and update NFL data in a relational database.
The Unlicense
1.07k stars 263 forks source link

Old question, new scheme. #1

Closed teamfball closed 10 years ago

teamfball commented 10 years ago

""How do I programmatically resolve where a player will be before the game takes place? Additionally, a player’s injury status pre game is another hurdle.""

We discussed this topic at length five or six months ago. Any chance the DB will afford a solution?

BurntSushi commented 10 years ago

TL;DR - Unlikely to be resolved completely. Sorry.

The essential problem is name mapping. In GameCenter's JSON feed, for example, Chris Johnson's name is C.Johnson. Guess what Calvin Johnson's name is? C.Johnson. This by itself isn't completely damning, since we might be able to use other information to match players. But it becomes incredibly tricky. A player in GameCenter JSON isn't the same as a player on a roster. Namely, a player in GameCenter JSON corresponds to immutable statistics about a particular person in a single point in time. But a player on a roster is ephemeral and mutable. It can change at any time. So there's just absolutely no reliable way of mapping players in GameCenter JSON to the ephemeral players found on a roster unless NFL provides a mapping.

Fortunately, there is a way to get that mapping. The code to do it is in scripts/download-player-data. The problem is that it needs to ping NFL.com 32 times to get roster data, and then ping NFL.com for each player in the NFL. In doing this, we can discover a mapping from C.Johnson to Calvin Johnson by way of a unique identifier.

The number of requests required to establish this mapping means that it isn't a good idea to rely on this for "real time" data. You could certainly run it before a batch of games every week, and that might be good enough. In particular, if you have a player stats object, then you can access that player's information with the player attribute. Here's a brief code sample:

import nflgame

game = nflgame.one(2012, 1, 'NE', 'NE')
tfb = list(game.drives.plays().players().filter(name='T.Brady'))[0]
print tfb.formatted_stats()
print tfb.player.name, tfb.player.college

That tfb.player object is populated directly from nflgame/players.json. If players.json is up to date, then it will tell you which team a player is on. So if you wanted to print the name, team and position of every player in the NFL, you could do:

for p in nflgame.players.itervalues():
    print p.name, p.team, p.position

And you can get statistics this way for a particular player:

tfb = nflgame.find('Tom Brady')[0]
print tfb.name, tfb.team, tfb.position

stats = list(tfb.plays(2012, 1).players().filter(playerid=tfb.playerid))[0]

Those stats should be the same as the ones above. So this allows you to complete the circle: full player name to GameCenter statistics and back. The problem is that keeping that mapping up to date is expensive. If you wanted to keep it up to date yourself, then running

download-player-data > players.json

and moving that players.json into the nflgame package directory will square you away.

With regard to injury status, I think that is outside the scope of nflgame. I'm unlikely to add it. I want to keep the number of external sources to a minimum. In particular, getting injury updates suffers from precisely the problem I've just outlined. The mapping is difficult to get.

I do have one idea though. It might be possible to make download-player-data a bit cheaper than it is. Namely, I could provide an --update switch that only tries to fetch player information that differs essentially from the data we already have (like a switched team or position). However, it would still require a minimum of 32 requests (one for each team), so it couldn't be updated in real time. But I'd feel better about, say, running it every day. But you'd have to set up a cron job (or whatever the Windows equivalent is) to run it.

So I'll leave this issue open for now until I get around to looking into that.

teamfball commented 10 years ago

I use the player attribute to get --Joe Flacco (QB, BAL)--from which I extract Joe Flacco instead of J.Flacco. However, looking at Thursday’s results, --Austin Collie-- just signed with the 49ers over the weekend, and a few others are missing. Additionally the final stats from last season omit 191 player references. So my first challenge is current player.json data. But pinging the NFL 1696 plus times is a risk I’m not willing to take on a weekly basis just yet.

Around draft time we chatted about this data source,
http://www.nfl.com/players/search?category=lastName&filter=B&playerType=current
which includes a complete name, current status, position and appears to be updated frequently. Assuming it also contains a player ID for mapping, this might work for "real time" or daily data, it involves 44 pages of data. I think at one point you asked if knew the json key or code for this, which I do not. But I’m willing to look if you could explain what it is exactly I’m looking for. Didn’t you find the other necessary json feeds by accident?

BurntSushi commented 10 years ago

However, looking at Thursday’s results, --Austin Collie-- just signed with the 49ers over the weekend, and a few others are missing. Additionally the final stats from last season omit 191 player references.

This is all unfortunately expected behavior. It's an artifact of linking ephemeral players with players associated with a particular point in time. It's just never going to correspond completely. But yeah, I haven't updated the players.json file in at least a week, so it won't be completely up to date. But you can update it yourself.

That 191 number sounds about right too. It just means that there were 191 players in the stat books last year that aren't on a current roster this year. Note that a player can be on the stat books even if they don't play a snap (I think). They could be on the roster for a single game.

But pinging the NFL 1696 plus times is a risk I’m not willing to take on a weekly basis just yet.

I know it sounds like a lot of requests, but it's probably safe to do it weekly. Up to you. My modifications that I mentioned should bring the number down drastically. (I'm 95% sure it's possible.)

which includes a complete name, current status, position and appears to be updated frequently. Assuming it also contains a player ID for mapping, this might work for "real time" or daily data, it involves 44 pages of data. I think at one point you asked if knew the json key or code for this, which I do not. But I’m willing to look if you could explain what it is exactly I’m looking for.

I did look into use those pages, but it's the same setup as the roster pages I'm using now. I'll show you. Pick a name on that page, click on it. Now right click somewhere on the web page and click "View Page Source". You should now see the HTML for that page. Do a search for the string GSIS ID. In Tom Brady's case, I see:

<!-- 
Player Info
ESB ID: BRA371156
GSIS ID: 00-0019596
-->

That last GSIS ID is what I need. So even with that list of players, I need to load each player page; which is unfortunately what I'm doing already. The GSIS ID is the "json key or code" that I need. :-)

Didn’t you find the other necessary json feeds by accident?

Kinda sorta. I've done plenty of web development in my day, so I looked hoping to find something like that. I still think of it as a gold mine. :-) It is quite literally the only thing of its kind on the web (that is publicly available) that I know of.

I can assure you that the player rosters on the pages we've been looking at are not being generated from JSON data. There are other roster lists on the web that I've found (as in, a single request to get every NFL player), but they aren't from NFL.com and therefore don't contain the identifiers I need.

BurntSushi commented 10 years ago

P.S. I apologize if I've repeated junk that I've said in the past that you already know. My memory is bad, and since this is an issue tracker, I think it's crucially important to explain as much as I know about the problem at hand. So when someone else asks me about this 3 months from now, I won't have a clue how to answer, but I'll know to point them to this issue. :-)

(I'm also optimistic. If I pose a well defined problem, then it will make it easier for a sufficiently clever person to solve the problem for me!)

BurntSushi commented 10 years ago

@teamfball I am a giant idiot. I've only now realized that you asked about nfldb and not nflgame. Good golly.

My answer unfortunately doesn't change much. However, I do believe the situation will get slightly better. Namely, player meta data will be inserted and updated as it comes in. So that the list of player meta data won't just be a snapshot of the current rosters, but will contain players who were once on a roster (and logged via nflgame) but are no longer on one. It really isn't that much better, because it's still limited to whenever you start populating the database.

The main purpose of nfldb is to provide a relational model of the data inside nflgame. Among other things, it should make searching over an entire season near instantaneous (as opposed to a few seconds on my system).

teamfball commented 10 years ago

Idiot

I think not. I clearly muddled the topic by thinking nflgame myself. Perhaps the issue belongs within nflgame, sorry about that.

If you wanted to keep it up to date yourself, then running download-player-data > players.json and moving that players.json into the nflgame package directory will square you away.

Shamefully, I have no clue how to do this. I see players.json, I see and run player.py but nothing changes. I have successfully made my own modules to create my custom playerdata.csv files, but the download-player-data eludes me. Attempts to update the schedule were unsuccessful as well.

I could provide an --update switch that only tries to fetch player information that differs essentially from the data we already have (like a switched team or position)........But you'd have to set up a cron job (or whatever the Windows equivalent is) to run it.

I will eagerly waitv for your possible update. By ‘cron’ job, I trust you mean to simply download that info prior to running my playerdata module.

With regard to injury status, I think that is outside the scope of nflgame. I'm unlikely to add it. I want to keep the number of external sources to a minimum. In particular, getting injury updates suffers from precisely the problem I've just outlined. The mapping is difficult to get.

Have you looked at the team pages, each team has their own injury listing? Surly the mapping will be on those pages. If not, the names should be in the exact same format. Coupling that with the aforementioned update procedure should allow me to scrape the injury status reports.

The main purpose of nfldb is to provide a relational model of the data inside nflgame

I’m excited to see this as well. Because several months ago I experimented with creating my own mysql db using this data. Will the db be local for each user? And what db are you using?

BurntSushi commented 10 years ago

Shamefully, I have no clue how to do this. I see players.json, I see and run player.py but nothing changes. I have successfully made my own modules to create my custom playerdata.csv files, but the download-player-data eludes me. Attempts to update the schedule were unsuccessful as well.

It's a bit tricky. The problem is that scripts/download-player-data is not included in the release, but instead is only in the git repository. I'll try to find a way to shove it into the release when I make those updates. I'll expose something in the API that will let you set which players.json file to use, so you don't have to worry about putting it in the right spot.

I will eagerly waitv for your possible update. By ‘cron’ job, I trust you mean to simply download that info prior to running my playerdata module.

A cron job on Linux is something which runs at a defined frequency. For example, I have a program that runs every day at 5:10AM that downloads the IMDB database. :-) I know Windows has something equivalent, but I don't know what it's called or how to use it.

Have you looked at the team pages, each team has their own injury listing? Surly the mapping will be on those pages. If not, the names should be in the exact same format. Coupling that with the aforementioned update procedure should allow me to scrape the injury status reports.

Yeah, I just looked at it. The injuries should actually only be an additional 32 requests through some trickery.

I will give some thought to injuries, but my initial feelings is that it's still out of the scope of nflgame and nfldb. In the worst case, I can write something for you that does the scraping with only 32 requests.

I’m excited to see this as well. Because several months ago I experimented with creating my own mysql db using this data. Will the db be local for each user? And what db are you using?

Yes, it will have to be local. I will never be in the business of a centralized service (e.g., with the database on my public server) because of the costs involved and the very real risk of getting a Cease and Desist letter from someone who thinks they own the data. I would have no choice but to comply, and that would just suck. The beauty of nflgame, and indeed nfldb too, is that it's decentralized. The only way to stop it is for the NFL to cut off their JSON feed. Fortunately, I think other web sites like ESPN use that same feed for live updates. (Weird, eh?)

And it will be using PostgreSQL. All of the cool kids have moved away from MySQL. :-)

BurntSushi commented 10 years ago

OK. This issue was filed before nfldb development ever really began.

I believe the new nflgame-update-players script resolves this issue for the most part. And since that data is included in nfldb, I'm marking this issue resolved.

Note that the following wiki articles may be relevant: