BurntSushi / nfldb

A library to manage and update NFL data in a relational database.
The Unlicense
1.08k stars 264 forks source link

penalty_yds is always positive (in nflgame) #20

Open cptrguru opened 10 years ago

cptrguru commented 10 years ago

I found a data issue that you may/may not be able to fix, depending upon the pull from the feed. The play.penalty_yds data is unsigned, therefore, it does not accurately reflect the yardage gained/lost. All data is a positive integer value.

For example, if Seattle was in possession and had an offensive holding penalty, one would expect the play.penalty_yds to contain -10, yet the actual data contained currently would be 10. If Denver was in possession and there was a defensive encroachment penalty, one would expect the play.penalty_yds to contain 5.

Without properly signed data, it forces a string match within play.description for 'PENALTY on {team}', which may return inaccurate results in the event there were multiple penalties on the play (For example, gsis_id=2009092011, drive_id=4, play_id=836).

The column drive.penalty_yards is accurately signed, but is a composite of all plays within that drive, which does not break down the penalty yards on each particular team, let alone individual plays.

Obviously, this issue is dependent upon what data can be extracted from the data feed, so a fix may not be possible... but, it would be really nice to have! Thanks in advance for taking a look at the issue!

BurntSushi commented 10 years ago

Hmm. I'm looking at the schema for the play table, and it says that the type of penalty_yds is smallint, which is a signed type. (Trivia: PostgreSQL does not have unsigned integers. nfldb adds a usmallint domain which is really just a constraint making sure that column never holds negative integers.)

Aside from that though, you are absolutely right that this is a problem. But it's a problem with the data, unfortunately. Even in nflgame, it looks like there are no plays with negative penalty yards, so that tells me that the source data just reports penalty yards as an absolute value.

There's really only one good way to solve this. nflgame needs to look at plays with penalties and give the penalty_yds field a sign. That way, the data is always correct regardless of where you look at it.

But as you said, this is complex and hard to get right. That's a good reason for it to be in the library. :-)

Anyway, I'll accept this issue. I'm happy to receive a PR (for nflgame---nfldb shouldn't require any modification for this fix other than a complete rebuild of the database), but otherwise, I'll work on it when I get a chance.

cptrguru commented 10 years ago

I've had this issue in the back of my mind all day. One thing that popped into my mind was that making the play.penalty_yds value negative if the possession team is the offending team, yet having a positive value for a defensive penalty, may be problematic... It would force a reference to play.pos_team for negative values in play.penalty_yds, yet an ambiguous join with game.away_team and game_home_team where they are not equal to play.pos_team for finding who was on defense.

I think the best solution would be to add a column to the play table (maybe play.penalty_team) to identify the offending team. This would allow one to do a simple group by to find the total penalties and yards for each team.

Thoughts?

BurntSushi commented 10 years ago

yet an ambiguous join with game.away_team and game_home_team where they are not equal to play.pos_team for finding who was on defense

I think a little SQL trickery is in order. :-)

SELECT
    p.penalty_yds, p.pos_team,
    COALESCE(
      NULLIF(g.home_team, p.pos_team),
      NULLIF(g.away_team, p.pos_team)
    ) AS def_team
FROM play AS p
LEFT JOIN game AS g ON g.gsis_id = p.gsis_id
WHERE p.gsis_id = '2014020200' AND p.penalty_yds > 0;

This outputs:

 penalty_yds | pos_team | def_team 
-------------+----------+----------
          10 | SEA      | DEN
          15 | DEN      | SEA
          10 | SEA      | DEN
          20 | DEN      | SEA
          10 | DEN      | SEA
          15 | DEN      | SEA
          10 | SEA      | DEN
           5 | DEN      | SEA
           5 | DEN      | SEA
          10 | DEN      | SEA
           4 | SEA      | DEN
          10 | SEA      | DEN
          15 | SEA      | DEN
           4 | SEA      | DEN
           5 | SEA      | DEN

I think the best solution would be to add a column to the play table (maybe play.penalty_team) to identify the offending team. This would allow one to do a simple group by to find the total penalties and yards for each team.

Despite the SQL trickery above, I think I have to agree with you to some extent. Although, part of me thinks it might be better to add a column def_team instead. I think it would have greater utility in general. The penalty team could then be determined with a simple CASE expression on the sign of penalty_yds.

Or we could add both columns.

I will say that I very much would like to stick with negative and positive values of penalty_yds regardless of any columns we add. Almost every single statistic is reported with respect to the team with possession of the football. In fact, I think penalty_yds is probably the only one that isn't that way now. (And I hadn't even realized until you filed this bug.)

I'm still kind of chewing on this. I think the negative/positive change is the only thing I feel particularly strong about at the moment, although I suspect adding at least one column will also be part of the solution too.

BurntSushi commented 10 years ago

It's settled then.

Now it's just a simple matter of programming. :-)

cptrguru commented 10 years ago

Great idea for the def_team, but would have to include the penalty_team as well to indicate which team the penalty is on... or would you do negative for offensive and positive for defensive values?

BurntSushi commented 10 years ago

Precisely. For example, CASE WHEN penalty_yds < 0 THEN pos_team WHEN penalty_yds > 0 THEN def_team ELSE NULL END.

cptrguru commented 10 years ago

SELECT pos_team, Sum(CASE WHEN penalty_yds < 0 THEN penalty_yds ELSE NULL END) AS off_penalty_yds, def_team, Sum(CASE WHEN penalty_yds > 0 THEN penalty_yds ELSE NULL END) AS def_penalty_yds FROM play ;

That would work! Fantastic!!!

BTW, I'm a ColdFusion developer and have worked with Oracle, SQLServer, MySQL, etc... but, never with Python / PostgreSql. However, I've been coding for 32 years, so languages are pretty easy to switch between... let me know if you would like some help somewhere down the road or need a 2nd set of eyes on something. Also, pretty decent with statistical analysis if you need a hand in that arena. ;-)

~Robbie

BurntSushi commented 10 years ago

Yup, that looks like it would work to me. Although, one think I should mention is that in Postgres, you actually won't be able to use off_penalty_yds or def_penalty_yds in your WHERE clause. (It can be used in GROUP BY and ORDER BY clauses if I recall right.) So it's a little awkward to use if you're writing out the SQL by hand. That's why I'm still kind of considering also going with your initial suggestion to add a penalty_team column.

BTW, I'm a ColdFusion developer and have worked with Oracle, SQLServer, MySQL, etc... but, never with Python / PostgreSql. However, I've been coding for 32 years, so languages are pretty easy to switch between... let me know if you would like some help somewhere down the road or need a 2nd set of eyes on something. Also, pretty decent with statistical analysis if you need a hand in that arena. ;-)

Thanks! I'll ping this issue when I've pushed something, and I'd definitely appreciate any feedback at that point too. Particularly if you can accomplish what you initially set out to accomplish.

And... ColdFusion you say? I haven't used that in almost 10 years. I was using it back when Macromedia was still running things. Might even have a couple of their books still lying around...

cptrguru commented 10 years ago

He he he... yeah, I'm really enjoying CF right now... every developer tends to teethe on CF then move on. That leaves me to be a "big fish in a small pond" when it comes to contracting, so I tend to get the contracts I want at the price I want. Before long, I'll have to move on (probably Java) because most all sizable businesses are bailing on CF. But, I'll cross that bridge when I get there.

cptrguru commented 10 years ago

...and if you still happen to have an installation of CF on your server:


<cfscript>
    obj     =   {};
    qry     =   {};
    temp    =   {};
    qry.nflPenalties    =   QueryNew('gsis_id,drive_id,play_id,penalty,penalty_yds,penalty_team','varchar,integer,integer,integer,integer,varchar');
    args            =       {};
    args.sql    =       "SELECT ";
    args.sql    &=      "p.gsis_id, ";
    args.sql    &=      "p.drive_id, ";
    args.sql    &=      "p.play_id, ";
    args.sql    &=      "p.penalty, ";
    args.sql    &=      "p.penalty_yds, ";
    args.sql    &=      "Replace(Right(Upper(p.description),Length(p.description) - (Instr(p.description,'PENALTY on') + 9)),'PENALTY ON',Char(167)) AS penalty_str ";
    args.sql    &=  "FROM ";
    args.sql    &=      "nflPlays p ";
    args.sql    &=  "WHERE ";
    args.sql    &=      "p.penalty_yds != 0 ";
    obj.Query           =   New Query();
    qry.penalties   =   obj.Query.execute(sql=args.sql).getResult();
    for
    (
        i.row   =       1;
        i.row   LTE qry.penalties.RecordCount;
        i.row   ++
    )
    {
        QueryAddRow(qry.nflPenalties);
        QuerySetCell(qry.nflPenalties,'gsis_id',qry.penalties.gsis_id[i.row],i.row);
        QuerySetCell(qry.nflPenalties,'drive_id',qry.penalties.drive_id[i.row],i.row);
        QuerySetCell(qry.nflPenalties,'play_id',qry.penalties.play_id[i.row],i.row);
        QuerySetCell(qry.nflPenalties,'penalty',qry.penalties.penalty[i.row],i.row);
        QuerySetCell(qry.nflPenalties,'penalty_yds',qry.penalties.penalty_yds[i.row],i.row);
        for
        (
            i.loc   =       1;
            i.loc   LTE ListLen(qry.penalties.penalty_str[i.row],Chr(167));
            i.loc   ++
        )
        {
            temp.penalty_str    =   Trim(ListGetAt(qry.penalties.penalty_str[i.row],i.loc,Chr(167)));
            if(temp.penalty_str CONTAINS 'DECLINED' == false)
            {
                temp.team   =   Left(temp.penalty_str,3);
                temp.lastChar   =   Asc(Right(temp.team,1));
                if
                (
                    temp.lastChar LT 65
                        ||
                    temp.lastChar GT 90
                )
                {
                    temp.team   =   Left(temp.team,2);
                }
                QuerySetCell(qry.nflPenalties,'penalty_team',temp.team,i.row);
                break;
            }
        }
    }
    WriteDump(qry.nflPenalties);
</cfscript>

This may give you some ideas for the Python code? There are about 20k records created here, so it's a chunk... you might want to narrow down the SQL's WHERE clause to a more reasonable set of data before running. Also note, I've been pulling the data from PostgreSql into MySQL for my own preferences, so the table names are not what you'd expect.

BurntSushi commented 10 years ago

Oh heavens no, I've long since purge CF from my life completely. :-)

You might want to check out nfldb/query.py. I went to great lengths to abstract over SQL writing. It actually tries to avoid JOINs since they're so slow, and instead prefers to run multiple queries dealing with the primary keys only.

The wiki has lots of neat examples that might even get a CF programmer to take a dive into Python. :-)

(But there's nothing too special for penalties, other than standard comparisons on whatever is in the DB.)

cptrguru commented 10 years ago

Joins are only slow if you've got bad SQL, have a Commodore 64, haven't indexed your tables appropriately, or have a crap-o-la database. ;-)

I appreciate the query.py package, but I'm waaaaaaaaaay too old-school to leave my queries to someone else (no offense)... old DBA control-freak issues, ya know!

Yeah, I've started playing around with Python since I found your cool little JSON scrape-o-matic. My first impressions are that it's a good little language with a lot of maturing ahead of it. I think I'd prefer something more complete (such as Perl), but that's just my personal nerd bias. ColdFusion pays the bills for the time being, but I'm realistic in that the future is short lived for the language unless Adobe starts paying more attention to their investment. Java is going to be the big-boy for the foreseeable future because of it's flexibility, power, and dominance in the current web market, and since ColdFusion is just a "wrapper language" for Java, it's the logical step in my career (baring Jennifer Aniston starts knocking on my door).

BurntSushi commented 10 years ago

This is a little trickier than I thought. How would a play like this be resolved?

15 (SF, OWN 31, Q4, 1 and 10) (13:46) C.Kaepernick pass incomplete deep middle to M.Crabtree. PENALTY on TB-W.Gholston, Defensive Offside, 5 yards, enforced at SF 31 - No Play. PENALTY on SF-M.Crabtree, Unsportsmanlike Conduct, 15 yards, enforced between downs.

Notice that there are two PENALTY on ... phrases referring to two different teams.

BrutalSimplicity commented 9 years ago

Regular expression?

In [41]: s = '15 (SF, OWN 31, Q4, 1 and 10) (13:46) C.Kaepernick pass incomplete deep middle to M.Crabtree. PENALTY on TB-W.Gholston, Defensive Offside, 5 yards, enforced at SF 31 - No Play. PENALTY on SF-M.Crabtree, Unsportsmanlike Conduct, 15 yards, enforced between downs.'

In [42]: m = re.findall(r'PENALTY.*?\.', re.sub(r'(\w)\.(\w+)', r'\1*\2', s.upper()))

In [43]: m
Out[43]:
['PENALTY ON TB-W*GHOLSTON, DEFENSIVE OFFSIDE, 5 YARDS, ENFORCED AT SF 31 - NO PLAY.',
 'PENALTY ON SF-M*CRABTREE, UNSPORTSMANLIKE CONDUCT, 15 YARDS, ENFORCED BETWEEN DOWNS.']

In [44]: s = '(8:16) 4-M.Darr punts 60 yards to NE 11, Center-92-J.Denney. 11-J.Edelman pushed ob at NE 22 for 11 yards (42-S.Paysinger). PENALTY on NE-24-R.Melvin, Illegal Block Above the Waist, 10 yards, enforced at NE 20.'

In [45]: m = re.findall(r'PENALTY.*?\.', re.sub(r'(\w)\.(\w+)', r'\1*\2', s.upper()))

In [46]: m
Out[46]: ['PENALTY ON NE-24-R*MELVIN, ILLEGAL BLOCK ABOVE THE WAIST, 10 YARDS, ENFORCED AT NE 20.']

Not sure how this works across all plays though. I would imagine you wouldn't run this unless there was a penalty on the play. So... 'if PENALTY in s.upper(): m = re.findall(r'PENALTY.*?\.', re.sub(r'(\w)\.(\w+)', r'\1*\2', s.upper())). Although you would likely just convert s to upper in the DB. You could also run the reg in the DB I think, but I don't know the syntax.

That's one of the things I don't like about the NFL's play-by-play format. It's rather inconsistent, which makes it difficult to pull information out of. I'm currently scraping pro-reference, and the data is much cleaner, but the information is a bit more difficult to access.

BurntSushi commented 9 years ago

Yeah, I don't really want to spend my time writing regexes for all the various play-by-play formats. :-)

BrutalSimplicity commented 9 years ago

Not really too time-consuming man, you could start here, and then elaborate . I'm working on one for pro-football-reference, and this is where it's at so far. Yea, it's hard to read, but right now I'm converting it into Python. Maybe you can adapt the two sources to suit your needs. I'm sure people would appreciate the additional information that this provides.

/* Built in LinqPad. I was experimenting with Linq at the time, so please forgive the harsh use of the ternary operator. */
var temp = (from game in Game_data
            select new { 
                data = game,
                compare = game.Detail.ToUpper() 
            }).AsEnumerable();

var stats = 
(from game in temp
let pass = game.compare.Contains("PASS")
let passIncomplete = game.compare.Contains("INCOMPLETE")
let passRange = Regex.Match(game.compare, @"(?:COMPLETE|INCOMPLETE) (?<range>DEEP|SHORT)")
let passDirection = Regex.Match(game.compare, @"(?:COMPLETE|INCOMPLETE) (?:\w+) (?<direction>LEFT|MIDDLE|RIGHT)")
let rushDirection = Regex.Match(game.compare, @"(?<direction>RIGHT|LEFT) (?:GUARD|TACKLE|END)|UP THE (?<direction>MIDDLE)")
let rush = (rushDirection.Success) ? true : Regex.Match(game.compare, @"(\w+\s)+FOR").Success
let rushLineman = Regex.Match(game.compare, @"(?:RIGHT|LEFT) (?<lineman>GUARD|TACKLE|END)")
let kickoff = game.compare.Contains("KICKS OFF")
let punt = game.compare.Contains("PUNT")
let yards = Regex.Matches(game.compare, @"(-?\d+) YARDS?")
let isExtraPoint = game.compare.Contains("EXTRA POINT")
let isFieldGoal = game.compare.Contains("FIELD GOAL")
let isFieldGoalNoGood = game.compare.Contains("NO GOOD")
let isTouchback = game.compare.Contains("TOUCHBACK")
let isTouchdown = game.compare.Contains("TOUCHDOWN")
let isOnside = game.compare.Contains("ONSIDE")
let isBlocked = game.compare.Contains("BLOCKED")
let isSack = game.compare.Contains("SACK")
let isInterception = game.compare.Contains("INTERCEPT")
let isPenalty = game.compare.Contains("PENALTY")
let isTimeout = game.compare.Contains("TIMEOUT")
let isTwoPoint = game.compare.Contains("TWO POINT")
let isChallenge = game.compare.Contains("CHALLENGE")
let isOverturned = game.compare.Contains("OVERTURN")
let isKneel = game.compare.Contains("KNEEL")
let isSpike = game.compare.Contains("SPIKE")
let isSafety = game.compare.Contains("SAFETY")
let isNoPlay = game.compare.Contains("NO PLAY") || game.compare.StartsWith("PENALTY ON")
select new 
{ 
PlayID = game.data.Play_id,
PlayType = (isTwoPoint) ? "TWO POINT" :
           (isNoPlay) ? "NOPLAY" :
           (pass || isSack) ? "PASS" : 
           (rush) ? "RUSH" : 
           (kickoff) ? "KICKOFF" :
           (isOnside) ? "ONSIDE" :
           (punt) ? "PUNT" : 
           (isExtraPoint) ? "EXTRA POINT" : 
           (isFieldGoal) ? "FIELD GOAL" : 
           (isKneel) ? "KNEEL" : 
           (isSpike) ? "SPIKE" : "",
PlayResult = (  ((pass) ? 
                (((passIncomplete) ? "INCOMPLETE" : "COMPLETE") +
                 ((passRange.Success) ? "|" + passRange.Groups["range"] : "") +
                 ((passDirection.Success) ? "|" + passDirection.Groups["direction"] : "") +
                 ((isSack) ? "|SACK" : "") +
                 ((isInterception) ? "|INTERCEPTION" : "")) :
                (rush) ? 
                 (((rushDirection.Success) ? "|" + rushDirection.Groups["direction"] : "") +
                  ((rushLineman.Success) ? "|" + rushLineman.Groups["lineman"] : "")) : 
                (kickoff) ?
                 ((isTouchback) ? "TOUCHBACK" : 
                 (isBlocked) ? "BLOCKED" : "") :
                (isFieldGoal) ?
                 ((isFieldGoalNoGood) ? "NO GOOD" : "GOOD") :
                "") +
                ((isTouchdown) ? "|TOUCHDOWN" : 
                 (isSafety) ? "|SAFETY" : "")
             ).Trim('|'),
Yards1 = (yards.Count > 0) ? (int?)int.Parse(yards[0].Groups[1].Value) : (int?)null,
Yards2 = (yards.Count > 1) ? (int?)int.Parse(yards[1].Groups[1].Value) : (int?)null,
IsPenalty = isPenalty,
IsTimeout = isTimeout,
IsTwoPoint = isTwoPoint,
IsChallenge = isChallenge,
IsOverturned = isOverturned,
IsNoPlay = isNoPlay,
Detail = game.data.Detail,
GameID = game.data.Game_id
}).Dump();
BurntSushi commented 9 years ago

And now for the rest of the categories: https://github.com/BurntSushi/nfldb/wiki/Statistical-categories

And IIRC, the plays have some variation as time has moved on. Descriptions are slightly different in 2001 than now. But I haven't looked in a while.

Yes it's time consuming. But hey, if it's no problem, and you write a program that extracts all of the stats that nfldb needs from plays going back to 2001, then you can bask in the eternal love of nfldb users everywhere. :-)

BrutalSimplicity commented 9 years ago

So, you're DB provides all of those categories already, or are those to-do?

BurntSushi commented 9 years ago

Yes. All of them. Because they're available in a structured format via an undocumented JSON feed going back to 2009.

BrutalSimplicity commented 9 years ago

Excellent! Thanks.

BrutalSimplicity commented 9 years ago

Many of those statistics are not in the feed. How are you accomplishing that?

BurntSushi commented 9 years ago

Yes they are. All of the code is open source, I invited you to examine for yourself: https://github.com/BurntSushi/nflgame/blob/master/nflgame/statmap.py and https://github.com/BurntSushi/nflgame/blob/master/nflgame/game.py

BrutalSimplicity commented 9 years ago

Yea, just took a look at those. That is incredible! Awesome job, you and your contributors have done here.

I'll probably borrow some of the code to add to my sqlite DB. I'm a fan of Postgresql, but I'd like to make the information more portable for those who just want to mess with the DB, and don't need an API.