adding missing data - Githubissues

BurntSushi commented 10 years ago

@ochawkeye brought this up in #62:

That is unfortunate as it does look like that entire last 14 minutes and 56 seconds of the 4th quarter are missing from this game. For guys like Antonio Brown those are some significant stats that are missing.

In my own script, I have added a section to manually include some of the "missing" data and/or stat corrections that come down from the NFL mid-week. Is there a method for doing this sort of thing for nflgame? Obviously these manual tweaks aren't ideal, but with the community that has been growing I imagine we are getting close to the numbers necessary to make it easier to crowd source this than to have multiple people doing it on their own.

I know it hasn't been your highest priority goal to aspire to make nflgame 100% accurate, but this would allow to get closer to true and eliminate some of those gaps that accumulate over the course of a year as well.

BurntSushi commented 10 years ago

There are absolutely no mechanisms in nflgame or nfldb that facilitate adding missing data. It is a hard problem. Having the labor of people to actually do the corrections is only one piece of it.

First and foremost, there are two different types of stat corrections. One of them is what you see mainstream fantasy football web sites employing. They are typically a result of statistical errors made while the game is playing that are correct after the fact. For example, maybe a sack was mis-credited or the yardage on a reception was off slightly. These are then reported as, "Chandler Jones had 2 sacks during the game as opposed to 1."

Secondly, the other type of correction is related to malformed data from the JSON feed. This is the case in this issue and seems to generally be the most common type of error found in nflgame. These are not reported as incorrect anywhere, and thus, some type of labor force must manually catch these errors.

I think that part, at least, is well understood. With a sufficient number of eyes on the data, it seems like this is a problem that could be solved.

However, there is a more sinister problem lurking. Namely, assuming we could get people to correct data, how does this get added to nflgame? Does the solution affect nfldb?

On the surface, it seems like we should just be able to, for example, change Chandler Jones' sack count for a particular game and be done with it. I think it's appropriate to label this sort of correction as a accumulative correction. But it doesn't stop there. Namely, in which play did Chandler Jones get a sack that he wasn't credited for? How many yards was it? Who did he sack? Was it split with someone else? I think we should call this kind of correction a play correction.

nflgame has two ways of viewing statistics. Game statistics and play statistics. Game statistics can be corrected only by accumulative corrections while play statistics can only be corrected by play corrections.

It is my belief that finding play corrections requires significantly more manual labor. Accumulative corrections, however, are relatively easy. All you need to do is check nflgame's accumulative stats for a player in a single game with some other source. It's easy precisely because that data can be scraped from other web sites. But there is no other source of play-by-play data, and therefore, play corrections really cannot be found by a machine with only a single source of possibly errant data. Therefore, someone has to manually inspect a play-by-play breakdown of a game to find the correction.

One could just conceivably ignore play corrections and only apply accumulative corrections to game statistics. But this has two major problems:

There are many (important) statistics that are only available by aggregating play statistics.
nfldb has no concept of game statistics. All statistics in nfldb are play statistics.

BurntSushi commented 10 years ago

One might say that these corrections are necessary if, say, a web site wanted to run a league with data from nflgame or nfldb. The problem is that not all points in fantasy football can be computed from simple aggregate totals. For example, field goals have to be inspected on a case by case basis to determine how long the field goal was. Similarly for the "points allowed" category for defenses (at least, by Yahoo standard scoring).

But of course, there is still value in correcting aggregate totals, on a game-by-game basis, for other categories like sacks, yards, touchdowns, etc.

As I said, I firmly believe that discovering play corrections is prohibitive. That kind of corrections really requires a team analyzing game footage and taking down statistics themselves. So that leaves aggregate corrections.

But how are they stored? How do we know when to apply them? These are easy questions to answer if we complicate the API by adding a function with, "get game statistics, including corrections, for this player for this game." But that means all other functions would be tainted since they could potentially return incorrect data. So could we embed the corrections somehow? Sure, for game statistics in nflgame. But those corrections won't propagate to nfldb and there's no way to guarantee that they will take precedent with existing API calls like combine_max_stats.

BurntSushi commented 10 years ago

I am fairly stumped by this issue. I'm open to discussing other ideas...

ochawkeye commented 10 years ago

Thanks for taking the time, Andrew.

I intend to write a longer response later tonight, but will make this comment...

On the NFL stat correction side, somehow, Yahoo is determining the specific play in question. Now whether that is someone manually combing through the plays and noting the time or otherwise, I do not know.

BurntSushi commented 10 years ago

@ochawkeye That is interesting, but it only addresses the first kind of stat corrections, unfortunately. (Errant recording as opposed to corrupt JSON data.) So someone would still need to comb through the footage.

The big players either are paying through the nose for data from Stats, Inc., or they have the resources to record their own stats. In that case, it's very likely that corrections are reported in terms of play corrections. The problem is, no such thing exists for the JSON data, so we'd have to rely on aggregate corrections detected by comparing with some other source.

gojonesy commented 10 years ago

I'm confused. Is the JSON data not corrected coming over from the source? It seems that the source must correct the date because their Gameday information is corrected. Is that done manually, after the fact?

I was monitoring this game in the parking lot of a mall on the ESPN Sportscenter app :( on Sunday. I was unsure of the outcome of this game for quite a while because their app was not updating properly. I am deducing that the data for this game was messed up everywhere.

Pardon my confusion and newb-ness. I was under the impression that the data would get corrected, in which case we could re-acquire the JSON data after the correction.

BurntSushi commented 10 years ago

@radicalbestfest See for yourself :-) Link to the source JSON data. You'll notice that the last drive ends in the beginning of the 4th quarter, and there don't appear to be any drives after that. If you scroll all the way to the bottom, you'll see that the current clock time is Q4 14:56.

But the play-by-play on the actual gamecenter page is correct. However, if you dig a little bit, you'll notice that the play-by-play is not using the JSON data that nflgame uses, namely, it appears to be generated server side here. Since the JSON data is incorrect and the previous link has the correct play-by-play breakdown, we can only conclude that there are two different data sources. Or, there is only one data source but the JSON feed is never corrected.

It is my understanding that ESPN uses the same JSON feed for live updates while games are in progress. So that might explain why the ESPN data was wonky too.

We have zero expectation for the data of a game to have its JSON feed corrected.

BurntSushi commented 10 years ago

Another angle to this problem is to use the human readable descriptions of plays, as shown here, as an alternative source to play-by-play data. Then we could compare that with the JSON data. Or even replace the JSON data entirely.

But of course, it is human readable data. I've never really thought about what it would take to parse those descriptions at the granularity offered by the JSON data, but I'm quite certain it is not a trivial task.

On the upside, if we did that, we could have play-by-play data going back to 2001, I think.

BurntSushi commented 10 years ago

Actually, there is XML going back to 1999 that has human readable play descriptions.

gojonesy commented 10 years ago

Great find on the XML source!

The human readable source looks like a promising alternative.

Thanks for the explaination. I looked through the JSON data file that I had locally and noticed the same (obviously).

BurntSushi commented 10 years ago

The human readable source is not a promising alternative, unfortunately. It isn't clear at all to me that proper statistics can be consistently and accurately salvaged from it. It's one of those things where it's easy to nail the simple stuff (which is the majority case), but when plays get complex (interceptions, fumble recovery returns, etc.), then parsing them also gets significantly harder. Namely, it isn't as straight forward as parsing a simple template. We'd need some sort of grammar whose productions reasonably approximate any human readable play description. I'm not so sure on how to derive that grammar.

gojonesy commented 10 years ago

The gamebook XML file for this game here.

This format should at least make the process a little bit easier. As you said though, it is not going to be easy to translate. Any idea what sequence is referencing?

[{"sequence":2,"clubcode":"MIA","playerName":"M.Thigpen","statId":45,"yards":24}],"00-0028954":[{"sequence":3,"clubcode":"PIT","playerName":"R.Golden","statId":79,"yards":0}],"00-0023001":

Am I correct that this hasn't happened previously?

BurntSushi commented 10 years ago

@radicalbestfest The sequence field, we hypothesize, is used to indicate the order of events that occur in a single play. It's used to sort events in _json_play_events which is used to load the events attribute of each play. It's basically an undocumented feature at this point. I haven't had any real use for it. It is not included in nfldb.

This format should at least make the process a little bit easier.

Well, kind of. It's easier in the sense that we don't have to scrape HTML to get the data. But that was never the real difficulty anyway. :-)

There will have to be some sort of scraping though, I suspect. You'll notice that the XML does not have player GSIS identifiers. They can only be found in the JSON data or on the game summary pages on NFL.com. It'd be a pretty trivial scrape, but a scrape nonetheless.

Am I correct that this hasn't happened previously?

What hasn't happened? Massive malformed JSON data? You would be very wrong. Stats from nflgame can be inaccurate, end of story. See my tests on 2012 data that are compared against a selected set of aggregate stats from Yahoo.

Most of the time it's just a play missing or an event inside a play missing that results in some missing stats. But there are definitely other JSON files that have massive corruption. There was one that I had to manually fix that actually duplicated plays from a particular drive several times. (It should be in the commit logs.)

One could imagine a heuristic to search the data in nflgame so that it finds anomalous games. e.g., A large time gap between any two plays (in terms of game clock time) might indicate a chunk of missing data. But I haven't written that code.

dfd commented 10 years ago

I thought I would just share my experience here...

Years ago (before I knew python) I wrote a data scraper in Excel VBA that I still use to this day. It parses the play-by-play data for play type, yards gained, interceptions, fumbles, lost possession, down, distance to go, line of scrimage, etc. At the end, I run an automated sanity check to make sure that each successive play makes sense given the data of the play before and after it. Without fail, I still need to manually correct 15-30 plays each week. I could maybe fix half those errors with code if I really wanted, but some plays just have so much going on that it's hard to accurately pull out and aggregate the right numbers.

One thing that's interesting is that cbs sports clearly had problems with that PIT-MIA game as well, because its play-by-play is in a totally different format for that one. Happens a few times a year for them, so they must have some kind of secondary source.

BurntSushi commented 10 years ago

@dfd Neat. Did you employ any particular techniques to convert the human readable descriptions to structured data? It'd be useful to hear more about your methods!

BurntSushi commented 10 years ago

I just thought of an interesting idea. If we believe that the human readable text descriptions of plays are generally more accurate, perhaps there is a way to use them without translating them into a format that is machine readable. In particular, much of the errant data is due to missing plays.

So is there a way to define a correspondence between text descriptions from the XML (or NFL.com) and plays in nflgame? We could rely on the order of plays and the game times (which is easy to pull out of the text descriptions). Once we establish a correspondence, we could feasibly detect discrepancies between the two sets. Note that this would only be an existence test; each discrepancy would have to be manually inspected by a human. But I think that process is doable.

This sounds like a nice addition to nflgame for next year...

jaunt7 commented 8 years ago

This guy tried to parse the grammar a bit back - http://www.10flow.com/2013/01/10/nfl-play-by-play-data/ Parser at https://github.com/10flow/playbyplay

jaunt7 commented 8 years ago

For completely missing plays, it might be sufficient to cross-reference the human-readable source for missing times. If a time is recorded that doesn't exist in the plays, then clearly we are missing plays. Have you seen this to continue to be a big problem since the big missing end of game in reference for this thread?
As for the game I referenced in #184, the play had yards net of 28, but it didn't credit it to the player (stat had zero yards) - maybe we could flag games as "needing to be manually looked at" if the yards net is greater than zero but nobody was assigned yards, for certain stat numbers, or ignoring certain descriptions. Over the years have you accumulated a list of the common errors for the manual fixes you employ to run your league?

BurntSushi commented 8 years ago

@jaunt7 That sounds like a fine idea. There is almost no end to the number of heuristics one might adopt to suss out bad data. I personally probably won't lead that task, although I might be happy to maintain/mentor it if it interoperates with nfldb nicely.

jaunt7 commented 8 years ago

Of course, you've done enough work on your own. I will let you know if I come up with something worth contributing.

BurntSushi / nflgame

adding missing data #63