Open cptrguru opened 10 years ago
Hmm. I'm looking at the schema for the play
table, and it says that the type of penalty_yds
is smallint
, which is a signed type. (Trivia: PostgreSQL does not have unsigned integers. nfldb adds a usmallint
domain which is really just a constraint making sure that column never holds negative integers.)
Aside from that though, you are absolutely right that this is a problem. But it's a problem with the data, unfortunately. Even in nflgame
, it looks like there are no plays with negative penalty yards, so that tells me that the source data just reports penalty yards as an absolute value.
There's really only one good way to solve this. nflgame
needs to look at plays with penalties and give the penalty_yds
field a sign. That way, the data is always correct regardless of where you look at it.
But as you said, this is complex and hard to get right. That's a good reason for it to be in the library. :-)
Anyway, I'll accept this issue. I'm happy to receive a PR (for nflgame
---nfldb
shouldn't require any modification for this fix other than a complete rebuild of the database), but otherwise, I'll work on it when I get a chance.
I've had this issue in the back of my mind all day. One thing that popped into my mind was that making the play.penalty_yds value negative if the possession team is the offending team, yet having a positive value for a defensive penalty, may be problematic... It would force a reference to play.pos_team for negative values in play.penalty_yds, yet an ambiguous join with game.away_team and game_home_team where they are not equal to play.pos_team for finding who was on defense.
I think the best solution would be to add a column to the play table (maybe play.penalty_team) to identify the offending team. This would allow one to do a simple group by to find the total penalties and yards for each team.
Thoughts?
yet an ambiguous join with game.away_team and game_home_team where they are not equal to play.pos_team for finding who was on defense
I think a little SQL trickery is in order. :-)
SELECT
p.penalty_yds, p.pos_team,
COALESCE(
NULLIF(g.home_team, p.pos_team),
NULLIF(g.away_team, p.pos_team)
) AS def_team
FROM play AS p
LEFT JOIN game AS g ON g.gsis_id = p.gsis_id
WHERE p.gsis_id = '2014020200' AND p.penalty_yds > 0;
This outputs:
penalty_yds | pos_team | def_team
-------------+----------+----------
10 | SEA | DEN
15 | DEN | SEA
10 | SEA | DEN
20 | DEN | SEA
10 | DEN | SEA
15 | DEN | SEA
10 | SEA | DEN
5 | DEN | SEA
5 | DEN | SEA
10 | DEN | SEA
4 | SEA | DEN
10 | SEA | DEN
15 | SEA | DEN
4 | SEA | DEN
5 | SEA | DEN
I think the best solution would be to add a column to the play table (maybe play.penalty_team) to identify the offending team. This would allow one to do a simple group by to find the total penalties and yards for each team.
Despite the SQL trickery above, I think I have to agree with you to some extent. Although, part of me thinks it might be better to add a column def_team
instead. I think it would have greater utility in general. The penalty team could then be determined with a simple CASE
expression on the sign of penalty_yds
.
Or we could add both columns.
I will say that I very much would like to stick with negative and positive values of penalty_yds
regardless of any columns we add. Almost every single statistic is reported with respect to the team with possession of the football. In fact, I think penalty_yds
is probably the only one that isn't that way now. (And I hadn't even realized until you filed this bug.)
I'm still kind of chewing on this. I think the negative/positive change is the only thing I feel particularly strong about at the moment, although I suspect adding at least one column will also be part of the solution too.
It's settled then.
Now it's just a simple matter of programming. :-)
Great idea for the def_team, but would have to include the penalty_team as well to indicate which team the penalty is on... or would you do negative for offensive and positive for defensive values?
Precisely. For example, CASE WHEN penalty_yds < 0 THEN pos_team WHEN penalty_yds > 0 THEN def_team ELSE NULL END
.
SELECT pos_team, Sum(CASE WHEN penalty_yds < 0 THEN penalty_yds ELSE NULL END) AS off_penalty_yds, def_team, Sum(CASE WHEN penalty_yds > 0 THEN penalty_yds ELSE NULL END) AS def_penalty_yds FROM play ;
That would work! Fantastic!!!
BTW, I'm a ColdFusion developer and have worked with Oracle, SQLServer, MySQL, etc... but, never with Python / PostgreSql. However, I've been coding for 32 years, so languages are pretty easy to switch between... let me know if you would like some help somewhere down the road or need a 2nd set of eyes on something. Also, pretty decent with statistical analysis if you need a hand in that arena. ;-)
~Robbie
Yup, that looks like it would work to me. Although, one think I should mention is that in Postgres, you actually won't be able to use off_penalty_yds
or def_penalty_yds
in your WHERE
clause. (It can be used in GROUP BY
and ORDER BY
clauses if I recall right.) So it's a little awkward to use if you're writing out the SQL by hand. That's why I'm still kind of considering also going with your initial suggestion to add a penalty_team
column.
BTW, I'm a ColdFusion developer and have worked with Oracle, SQLServer, MySQL, etc... but, never with Python / PostgreSql. However, I've been coding for 32 years, so languages are pretty easy to switch between... let me know if you would like some help somewhere down the road or need a 2nd set of eyes on something. Also, pretty decent with statistical analysis if you need a hand in that arena. ;-)
Thanks! I'll ping this issue when I've pushed something, and I'd definitely appreciate any feedback at that point too. Particularly if you can accomplish what you initially set out to accomplish.
And... ColdFusion you say? I haven't used that in almost 10 years. I was using it back when Macromedia was still running things. Might even have a couple of their books still lying around...
He he he... yeah, I'm really enjoying CF right now... every developer tends to teethe on CF then move on. That leaves me to be a "big fish in a small pond" when it comes to contracting, so I tend to get the contracts I want at the price I want. Before long, I'll have to move on (probably Java) because most all sizable businesses are bailing on CF. But, I'll cross that bridge when I get there.
...and if you still happen to have an installation of CF on your server:
<cfscript>
obj = {};
qry = {};
temp = {};
qry.nflPenalties = QueryNew('gsis_id,drive_id,play_id,penalty,penalty_yds,penalty_team','varchar,integer,integer,integer,integer,varchar');
args = {};
args.sql = "SELECT ";
args.sql &= "p.gsis_id, ";
args.sql &= "p.drive_id, ";
args.sql &= "p.play_id, ";
args.sql &= "p.penalty, ";
args.sql &= "p.penalty_yds, ";
args.sql &= "Replace(Right(Upper(p.description),Length(p.description) - (Instr(p.description,'PENALTY on') + 9)),'PENALTY ON',Char(167)) AS penalty_str ";
args.sql &= "FROM ";
args.sql &= "nflPlays p ";
args.sql &= "WHERE ";
args.sql &= "p.penalty_yds != 0 ";
obj.Query = New Query();
qry.penalties = obj.Query.execute(sql=args.sql).getResult();
for
(
i.row = 1;
i.row LTE qry.penalties.RecordCount;
i.row ++
)
{
QueryAddRow(qry.nflPenalties);
QuerySetCell(qry.nflPenalties,'gsis_id',qry.penalties.gsis_id[i.row],i.row);
QuerySetCell(qry.nflPenalties,'drive_id',qry.penalties.drive_id[i.row],i.row);
QuerySetCell(qry.nflPenalties,'play_id',qry.penalties.play_id[i.row],i.row);
QuerySetCell(qry.nflPenalties,'penalty',qry.penalties.penalty[i.row],i.row);
QuerySetCell(qry.nflPenalties,'penalty_yds',qry.penalties.penalty_yds[i.row],i.row);
for
(
i.loc = 1;
i.loc LTE ListLen(qry.penalties.penalty_str[i.row],Chr(167));
i.loc ++
)
{
temp.penalty_str = Trim(ListGetAt(qry.penalties.penalty_str[i.row],i.loc,Chr(167)));
if(temp.penalty_str CONTAINS 'DECLINED' == false)
{
temp.team = Left(temp.penalty_str,3);
temp.lastChar = Asc(Right(temp.team,1));
if
(
temp.lastChar LT 65
||
temp.lastChar GT 90
)
{
temp.team = Left(temp.team,2);
}
QuerySetCell(qry.nflPenalties,'penalty_team',temp.team,i.row);
break;
}
}
}
WriteDump(qry.nflPenalties);
</cfscript>
This may give you some ideas for the Python code? There are about 20k records created here, so it's a chunk... you might want to narrow down the SQL's WHERE clause to a more reasonable set of data before running. Also note, I've been pulling the data from PostgreSql into MySQL for my own preferences, so the table names are not what you'd expect.
Oh heavens no, I've long since purge CF from my life completely. :-)
You might want to check out nfldb/query.py
. I went to great lengths to abstract over SQL writing. It actually tries to avoid JOINs since they're so slow, and instead prefers to run multiple queries dealing with the primary keys only.
The wiki has lots of neat examples that might even get a CF programmer to take a dive into Python. :-)
(But there's nothing too special for penalties, other than standard comparisons on whatever is in the DB.)
Joins are only slow if you've got bad SQL, have a Commodore 64, haven't indexed your tables appropriately, or have a crap-o-la database. ;-)
I appreciate the query.py package, but I'm waaaaaaaaaay too old-school to leave my queries to someone else (no offense)... old DBA control-freak issues, ya know!
Yeah, I've started playing around with Python since I found your cool little JSON scrape-o-matic. My first impressions are that it's a good little language with a lot of maturing ahead of it. I think I'd prefer something more complete (such as Perl), but that's just my personal nerd bias. ColdFusion pays the bills for the time being, but I'm realistic in that the future is short lived for the language unless Adobe starts paying more attention to their investment. Java is going to be the big-boy for the foreseeable future because of it's flexibility, power, and dominance in the current web market, and since ColdFusion is just a "wrapper language" for Java, it's the logical step in my career (baring Jennifer Aniston starts knocking on my door).
This is a little trickier than I thought. How would a play like this be resolved?
15 (SF, OWN 31, Q4, 1 and 10) (13:46) C.Kaepernick pass incomplete deep middle to M.Crabtree. PENALTY on TB-W.Gholston, Defensive Offside, 5 yards, enforced at SF 31 - No Play. PENALTY on SF-M.Crabtree, Unsportsmanlike Conduct, 15 yards, enforced between downs.
Notice that there are two PENALTY on ...
phrases referring to two different teams.
Regular expression?
In [41]: s = '15 (SF, OWN 31, Q4, 1 and 10) (13:46) C.Kaepernick pass incomplete deep middle to M.Crabtree. PENALTY on TB-W.Gholston, Defensive Offside, 5 yards, enforced at SF 31 - No Play. PENALTY on SF-M.Crabtree, Unsportsmanlike Conduct, 15 yards, enforced between downs.'
In [42]: m = re.findall(r'PENALTY.*?\.', re.sub(r'(\w)\.(\w+)', r'\1*\2', s.upper()))
In [43]: m
Out[43]:
['PENALTY ON TB-W*GHOLSTON, DEFENSIVE OFFSIDE, 5 YARDS, ENFORCED AT SF 31 - NO PLAY.',
'PENALTY ON SF-M*CRABTREE, UNSPORTSMANLIKE CONDUCT, 15 YARDS, ENFORCED BETWEEN DOWNS.']
In [44]: s = '(8:16) 4-M.Darr punts 60 yards to NE 11, Center-92-J.Denney. 11-J.Edelman pushed ob at NE 22 for 11 yards (42-S.Paysinger). PENALTY on NE-24-R.Melvin, Illegal Block Above the Waist, 10 yards, enforced at NE 20.'
In [45]: m = re.findall(r'PENALTY.*?\.', re.sub(r'(\w)\.(\w+)', r'\1*\2', s.upper()))
In [46]: m
Out[46]: ['PENALTY ON NE-24-R*MELVIN, ILLEGAL BLOCK ABOVE THE WAIST, 10 YARDS, ENFORCED AT NE 20.']
Not sure how this works across all plays though. I would imagine you wouldn't run this unless there was a penalty on the play. So... 'if PENALTY in s.upper(): m = re.findall(r'PENALTY.*?\.', re.sub(r'(\w)\.(\w+)', r'\1*\2', s.upper()))
. Although you would likely just convert s
to upper in the DB. You could also run the reg in the DB I think, but I don't know the syntax.
That's one of the things I don't like about the NFL's play-by-play format. It's rather inconsistent, which makes it difficult to pull information out of. I'm currently scraping pro-reference, and the data is much cleaner, but the information is a bit more difficult to access.
Yeah, I don't really want to spend my time writing regexes for all the various play-by-play formats. :-)
Not really too time-consuming man, you could start here, and then elaborate . I'm working on one for pro-football-reference, and this is where it's at so far. Yea, it's hard to read, but right now I'm converting it into Python. Maybe you can adapt the two sources to suit your needs. I'm sure people would appreciate the additional information that this provides.
/* Built in LinqPad. I was experimenting with Linq at the time, so please forgive the harsh use of the ternary operator. */
var temp = (from game in Game_data
select new {
data = game,
compare = game.Detail.ToUpper()
}).AsEnumerable();
var stats =
(from game in temp
let pass = game.compare.Contains("PASS")
let passIncomplete = game.compare.Contains("INCOMPLETE")
let passRange = Regex.Match(game.compare, @"(?:COMPLETE|INCOMPLETE) (?<range>DEEP|SHORT)")
let passDirection = Regex.Match(game.compare, @"(?:COMPLETE|INCOMPLETE) (?:\w+) (?<direction>LEFT|MIDDLE|RIGHT)")
let rushDirection = Regex.Match(game.compare, @"(?<direction>RIGHT|LEFT) (?:GUARD|TACKLE|END)|UP THE (?<direction>MIDDLE)")
let rush = (rushDirection.Success) ? true : Regex.Match(game.compare, @"(\w+\s)+FOR").Success
let rushLineman = Regex.Match(game.compare, @"(?:RIGHT|LEFT) (?<lineman>GUARD|TACKLE|END)")
let kickoff = game.compare.Contains("KICKS OFF")
let punt = game.compare.Contains("PUNT")
let yards = Regex.Matches(game.compare, @"(-?\d+) YARDS?")
let isExtraPoint = game.compare.Contains("EXTRA POINT")
let isFieldGoal = game.compare.Contains("FIELD GOAL")
let isFieldGoalNoGood = game.compare.Contains("NO GOOD")
let isTouchback = game.compare.Contains("TOUCHBACK")
let isTouchdown = game.compare.Contains("TOUCHDOWN")
let isOnside = game.compare.Contains("ONSIDE")
let isBlocked = game.compare.Contains("BLOCKED")
let isSack = game.compare.Contains("SACK")
let isInterception = game.compare.Contains("INTERCEPT")
let isPenalty = game.compare.Contains("PENALTY")
let isTimeout = game.compare.Contains("TIMEOUT")
let isTwoPoint = game.compare.Contains("TWO POINT")
let isChallenge = game.compare.Contains("CHALLENGE")
let isOverturned = game.compare.Contains("OVERTURN")
let isKneel = game.compare.Contains("KNEEL")
let isSpike = game.compare.Contains("SPIKE")
let isSafety = game.compare.Contains("SAFETY")
let isNoPlay = game.compare.Contains("NO PLAY") || game.compare.StartsWith("PENALTY ON")
select new
{
PlayID = game.data.Play_id,
PlayType = (isTwoPoint) ? "TWO POINT" :
(isNoPlay) ? "NOPLAY" :
(pass || isSack) ? "PASS" :
(rush) ? "RUSH" :
(kickoff) ? "KICKOFF" :
(isOnside) ? "ONSIDE" :
(punt) ? "PUNT" :
(isExtraPoint) ? "EXTRA POINT" :
(isFieldGoal) ? "FIELD GOAL" :
(isKneel) ? "KNEEL" :
(isSpike) ? "SPIKE" : "",
PlayResult = ( ((pass) ?
(((passIncomplete) ? "INCOMPLETE" : "COMPLETE") +
((passRange.Success) ? "|" + passRange.Groups["range"] : "") +
((passDirection.Success) ? "|" + passDirection.Groups["direction"] : "") +
((isSack) ? "|SACK" : "") +
((isInterception) ? "|INTERCEPTION" : "")) :
(rush) ?
(((rushDirection.Success) ? "|" + rushDirection.Groups["direction"] : "") +
((rushLineman.Success) ? "|" + rushLineman.Groups["lineman"] : "")) :
(kickoff) ?
((isTouchback) ? "TOUCHBACK" :
(isBlocked) ? "BLOCKED" : "") :
(isFieldGoal) ?
((isFieldGoalNoGood) ? "NO GOOD" : "GOOD") :
"") +
((isTouchdown) ? "|TOUCHDOWN" :
(isSafety) ? "|SAFETY" : "")
).Trim('|'),
Yards1 = (yards.Count > 0) ? (int?)int.Parse(yards[0].Groups[1].Value) : (int?)null,
Yards2 = (yards.Count > 1) ? (int?)int.Parse(yards[1].Groups[1].Value) : (int?)null,
IsPenalty = isPenalty,
IsTimeout = isTimeout,
IsTwoPoint = isTwoPoint,
IsChallenge = isChallenge,
IsOverturned = isOverturned,
IsNoPlay = isNoPlay,
Detail = game.data.Detail,
GameID = game.data.Game_id
}).Dump();
And now for the rest of the categories: https://github.com/BurntSushi/nfldb/wiki/Statistical-categories
And IIRC, the plays have some variation as time has moved on. Descriptions are slightly different in 2001 than now. But I haven't looked in a while.
Yes it's time consuming. But hey, if it's no problem, and you write a program that extracts all of the stats that nfldb
needs from plays going back to 2001, then you can bask in the eternal love of nfldb
users everywhere. :-)
So, you're DB provides all of those categories already, or are those to-do?
Yes. All of them. Because they're available in a structured format via an undocumented JSON feed going back to 2009.
Excellent! Thanks.
Many of those statistics are not in the feed. How are you accomplishing that?
Yes they are. All of the code is open source, I invited you to examine for yourself: https://github.com/BurntSushi/nflgame/blob/master/nflgame/statmap.py and https://github.com/BurntSushi/nflgame/blob/master/nflgame/game.py
Yea, just took a look at those. That is incredible! Awesome job, you and your contributors have done here.
I'll probably borrow some of the code to add to my sqlite DB. I'm a fan of Postgresql, but I'd like to make the information more portable for those who just want to mess with the DB, and don't need an API.
I found a data issue that you may/may not be able to fix, depending upon the pull from the feed. The play.penalty_yds data is unsigned, therefore, it does not accurately reflect the yardage gained/lost. All data is a positive integer value.
For example, if Seattle was in possession and had an offensive holding penalty, one would expect the play.penalty_yds to contain -10, yet the actual data contained currently would be 10. If Denver was in possession and there was a defensive encroachment penalty, one would expect the play.penalty_yds to contain 5.
Without properly signed data, it forces a string match within play.description for 'PENALTY on {team}', which may return inaccurate results in the event there were multiple penalties on the play (For example, gsis_id=2009092011, drive_id=4, play_id=836).
The column drive.penalty_yards is accurately signed, but is a composite of all plays within that drive, which does not break down the penalty yards on each particular team, let alone individual plays.
Obviously, this issue is dependent upon what data can be extracted from the data feed, so a fix may not be possible... but, it would be really nice to have! Thanks in advance for taking a look at the issue!