projections database add-on

hsharrison commented 9 years ago

Hey, thanks for this awesome library. I've been thinking of starting a new project, something similar to this but for keeping track of experts' projections.

I just thought I'd ask for advice, as I'd want it to integrate with nfldb--at least key off of game ID and player ID. But how tight is the question. Are things going to screw up if I had more tables to the database? And how hard would it be to leverage the query system as well? Appreciate any thoughts you might have.

As for how to make it available, I think the best option is a separate package with a dependency of nfldb. But if you think it would make more sense as a pull request, or even a fork, let me know.

BurntSushi commented 9 years ago

I would definitely say to do it as a separate package that depends on nfldb. I think adding projections is a great idea, but I'm unsure of the extra maintenance cost (for myself).

Are things going to screw up if I had more tables to the database?

Nope. As long as there are no name collisions, you should be OK. If you modify existing tables... that might be harder.

... however, one of the nice properties of nfldb is that it can rebuild itself and it will do migrations. Ideally, your extension would do the same, but if it's a separate package, you'd have to do your own migrations. And you'd probably want to write a wrapper script around nfldb-update. Overall, it shouldn't be too difficult, but I don't know what your experience level is. :-)

And how hard would it be to leverage the query system as well?

The query system is by far the most complex piece of nfldb because it does query generation, including joins. If your additions are simple, then it should be pretty easy to add it to the existing generator. (It even supports concatenating tables, so if your table is like an "extension" of an existing table---i.e., a one-to-one correspondence---then you should be good to go.)

... but I wrote all that from the perspective of modifying nfldb rather than writing a separate library. I think it is still possible, but we'd have to collaborate and I'd probably have to make some minor changes inside nfldb in order for it to be reasonable for you to do it.

We can collaborate however you like, but if you want to chat synchronously, you can visit me on IRC/FreeNode on #nflgame. Just mention my nick burntsushi.

hsharrison commented 9 years ago

Thanks for the prompt response. That's all about what I expected (/hoped) to hear.

I have a good amount of experience using databases, mostly interactively, but this will be my first time creating a database application.

I think the best way to start is to not worry about the query system at first. I'll write some code to set up new tables and start collecting scraping scripts to populate them. Then when that seems to be working we can consider if it would be worthwhile to tie it into the query system. So until then, I hopefully won't need much help from you.

One question, though, what do you mean about a wrapper around nfldb-update? Are you imagining it would also update the projections, or is there a technical reason?

Thanks again for your offer to help, hopefully this doesn't turn out to be too much harder than I'm anticipating...

BurntSushi commented 9 years ago

One question, though, what do you mean about a wrapper around nfldb-update? Are you imagining it would also update the projections, or is there a technical reason?

Yeah, it could update the projections. It's mostly as a convenience and you can do it as a separate transaction.

I think the best way to start is to not worry about the query system at first.

Definitely. It might be worth checking out nfldb/types.py to get a feel for how the objects are represented in memory for each table. You don't have to follow that pattern, but it would be easier for future integration. But yeah, start simple!

albertlyu commented 9 years ago

Out of curiosity, @hsharrison, how would you map game IDs and player IDs for player-game projections from external sources outside of NFL.com? Are you considering some sort of fuzzy string matching, or some sort of association table / cross-reference table to maintain your ID mappings? Or are those questions TBD at a later date?

hsharrison commented 9 years ago

Well, what's prompting this is that if I've been doing a lot of it already with csv files. I've been maintaining a table of mis-matching names by hand, mapping alternative spellings onto player IDs. I'll probably want to work on a more robust solution eventually, but it's what I'll start with at least. So far it has 54 entries after processing data from a few different sources.

BurntSushi commented 9 years ago

@hsharrison Do you know about the fuzzy player name searching in nfldb? That might help.

hsharrison commented 9 years ago

Yeah, I saw that. When I started I wasn't sure if it would be robust and I had much smaller dataset so only had a few entries in the association table. Now that I have a longer list it would be a good opportunity to test its robustness.

I was a little disappointed that you could only filter by current team, necessitating leaving the comparisons open to the entire database, but it might still be robust enough. I'll give it a shot.

BurntSushi commented 9 years ago

I was a little disappointed that you could only filter by current team

Yeah, it'd be nice to open that up a bit... It might be possible, but it would require accessing the play_player table for a history of teams that a player has recorded a statistic with.

rkmaddox commented 9 years ago

+1 for making past teams / rosters available. Not sure what it would look like, but having that functionality in some form would add a lot of flexibility, I think.

BurntSushi commented 9 years ago

@rkmaddox It's only possible to the extent that a player has recorded a statistic. So, e.g., offensive linemen are usually missing.

hsharrison commented 9 years ago

Well, I've started a repo: https://github.com/hsharrison/nfldb-projections

No code yet, just a README stub. The only thing of substance is a description of the new tables I'm thinking of adding. Curious if anyone has any comments. It does seem overly complex but I can't think of a better way to introduce the concept of fantasy points into the system.

At this point, we can take any discussion to the new repo, and ping back here with updates and/or questions about nfldb internals.

rkmaddox commented 9 years ago

Ah right, hadn't really thought about that quirk of the data source.

hsharrison commented 9 years ago

@BurntSushi to keep track of the add-on version, would you recommend adding a value to meta or adding a new meta table?

BurntSushi commented 9 years ago

@hsharrison Unless you have a really good reason for doing so, I would very much try to stay away from modifying existing tables.

hsharrison commented 9 years ago

Makes sense, thanks.

hsharrison commented 9 years ago

The ERD is a big spaghetti mess but I think I've got all the major issues sorted bar one: handling games that aren't in nfldb yet. I'm at a bit of a loss here. The best plan I can think of is to use some placeholder value in gsis_id and have a separate script that scans for new games and updates the ids accordingly. It could be incorporated into an nfldb-update wrapper for convenience.

What do you think? Is there better option that I'm missing? This feels like an ugly hack.

BurntSushi commented 9 years ago

Could you outline the use case here? Can it just not add info for games that don't exist in the database? Generally speaking, games can be in nfldb as soon as the schedule for them is known. For example, when the season starts, an entry for every game in the season gets added as soon as the schedule is released. Similarly for playoffs---you just have to update the nflgame schedule.

hsharrison commented 9 years ago

Well, that solves the problem then!

I just saw the comment in nfldb.update about only inserting games into the database 15 minutes before kickoff (if I'm interpreting that correctly) so I guess the solution is to grab the IDs from nflgame instead of nfldb. When I didn't see the IDs in my database I just figured they weren't yet available. Thanks for the info.

BurntSushi commented 9 years ago

Are you looking at the games_schedule function? Note the docstring. :-)

Returns a list of GSIS identifiers corresponding to games that have schedule data in the database but don't have any drives or plays in the database. In the typical case, this corresponds to games that haven't started yet.

The list is sorted in the order in which the games will be played.

Unfortunately, the update script is rather complex. It has to be smart about what it updates, or else it will be really slow. e.g., It will bulk insert schedule data (which are just rows in the game table). But it has to avoid pinging NFL.com for drive/play data for every game, otherwise it would be pretty ridiculous. So it checks the start date/time of the game to see whether it should query for additional data.

And once a game is over, it never changes.

An interesting property of this approach is that it is idempotent. If you run nfldb-update on a completely updated database, it should make no changes. If you delete any row from the game table and re-run nfldb-update, that row and all of its drive/play data will get re-inserted automatically. (This is especially useful on the rare occasion that game data gets corrupted.)

nfldb-update is pretty smart about what it does, but it pays for that in complexity.

hsharrison commented 9 years ago

I was looking at this comment just a few lines below that docstring. In short, games more than 15 minutes in the future are skipped. But now I see that's just for purposes of checking for more info rather than not inserting the game at all.

So I'm not sure why I didn't have the wild-card games in my database until I updated again just now. I thought I had updated yesterday or the day before, but I guess not.

Well, it looks like this was a false alarm. I'll just make sure nfldb is updated before inserting projections, and it should be able to find all the game IDs it needs. Idempotency will come in handy, then. :-)

BurntSushi commented 9 years ago

So I'm not sure why I didn't have the wild-card games in my database until I updated again just now. I thought I had updated yesterday or the day before, but I guess not.

That's because I just updated nflgame. You can see this comment here: https://github.com/BurntSushi/nflgame/issues/106#issuecomment-68490377 --- that command will let you update the nflgame schedules without having to wait for me to do it (after which, you'll need to do a pip install --upgrade nflgame). Once you update the schedules in nflgame, the next run of nfldb-update will pick them up.

hsharrison commented 9 years ago

Ooooh, everything makes sense now. Awesome.

By the way, I've spent the last couple days looking through your code extensively (in order to mimic your approach wherever possible), and I have to say it's been way easier than I expected to figure out what's going on. So thanks for all the hard work you put into this, the final product is a joy to work with.

BurntSushi commented 9 years ago

No problem. :-) But you may not be saying that once you get to SQL generation. :-) This one is pretty crazy: https://github.com/BurntSushi/nfldb/blob/master/nfldb/sql.py#L177

hsharrison commented 9 years ago

Yeah, I almost added a line "but I haven't got the query system yet". :-)

hsharrison commented 9 years ago

Still making progress...

Any idea how I could handle DST "players"? The only option I have seems to be allowing the player_id column of the various projection tables to match with either player.player_id or team.team_id. Does Postgres allow this kind of dual foreign-key constraint? It sounds like something that will create headaches.

Alternatively, would you be willing to hardcode these rows into the player table? Or do you have any other ideas?

Edit: would an intermediary table that is basically a concatenation of team and player be a feasible option? I'm running into the limits of my DB knowledge here.

BurntSushi commented 9 years ago

I definitely do not want to be adding defensive teams as players in the player table. That seems very wrong.

In all the code I've written, I've handled "defense players" as a special case unfortunately. I haven't yet discovered a good abstraction over regular players and defensive team "players."

The other option is to have two projection tables. One for regular players and one for defensive teams. I'd probably try that.

hsharrison commented 9 years ago

I definitely do not want to be adding defensive teams as players in the player table. That seems very wrong.

Fair enough.

The other option is to have two projection tables. One for regular players and one for defensive teams. I'd probably try that.

Just curious, what's the rationale? A pro seems to be that I could then split the offensive statistics from the defensive categories. The con is that queries become a lot more complicated if one is interested in fantasy points across all "positions".

Right now I'm leaning toward the intermediary table with a trigger to keep it updated when players are added. I think it will keep things easier, but I'm curious what your reason is to prefer splitting them.

hsharrison commented 9 years ago

Hmm, I got a second opinion in favor of the option of splitting them, so I'm willing to accept my intuition here is wrong.

BurntSushi commented 9 years ago

I absolutely think you should go with your intuition. I'm not happy with any solution I've ever come up with---I'm just tossing out ideas. :-)

Just curious, what's the rationale? A pro seems to be that I could then split the offensive statistics from the defensive categories. The con is that queries become a lot more complicated if one is interested in fantasy points across all "positions".

That second point is precisely the issue. If you split them, then running one query to grab everything is going to be difficult (so I agree with you there). It may be much easier to say, "If you want projections for individual players, then write this kind of SQL. And if you want projections for defensive teams, then write this kind of SQL." If you end up developing a library or a UI around this stuff, that's where you might be able to merge them under one common interface.

Your idea of using one id column to store either a player id or a team id is interesting. You're right to point out a foreign key problem. Maybe you just don't use a foreign key for that column?

Another option is for each of your projection tables to have two columns: player_id and team_id with the invariant that exactly one of them must always be NULL.

I really don't see a clear answer here. I'd suggest trying both and see which one works out better. :-)

hsharrison commented 9 years ago

Alright, thanks for the advice. Sounds like our thoughts on this are similar. I'll give..something...a shot.

hsharrison commented 9 years ago

Still thinking, but putting this link here for posterity as it touches upon all the options we just mentioned: http://stackoverflow.com/questions/441001/possible-to-do-a-mysql-foreign-key-to-one-of-two-possible-tables

BurntSushi / nfldb

projections database add-on #77