Open trevorcampbell opened 3 years ago
Another one. This one there are missing / "unknown" entries, but the award ref numbers, tracking numbers, and requester details match.
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8093112,HONORABLE MENTION,2013-12-22,HW502242,DELETED,2013-10-21,2013-10-21,,LARSON,ROBERT,9161,POLICE OFFICER
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8099927,HONORABLE MENTION,2014-12-23,HX-501556,DELETED,2014-11-10,2014-11-10,,PALUCH,PHILIP,9161,POLICE OFFICER
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8106187,DEPARTMENT COMMENDATION,2015-12-10,HY401082,DELETED,2015-08-28,2015-08-28,,SMITH,TIMOTHY,9161,POLICE OFFICER
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8111944,DEPARTMENT COMMENDATION,2016-12-30,HZ512156,DELETED,2016-10-05,2016-11-12,,BARRY,KEVIN,9161,POLICE OFFICER
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8093112,HONORABLE MENTION,2013-12-22,HW502242,DELETED,2013-10-21,2013-10-21,,LARSON,ROBERT,,
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8099927,HONORABLE MENTION,2014-12-23,HX-501556,DELETED,2014-11-10,2014-11-10,,PALUCH,PHILIP,,
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8106187,DEPARTMENT COMMENDATION,2015-12-10,HY401082,DELETED,2015-08-28,2015-08-28,,SMITH,TIMOTHY,,
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8111944,DEPARTMENT COMMENDATION,2016-12-30,HZ512156,DELETED,2016-10-05,2016-11-12,,BARRY,KEVIN,,
Tough call. On the one hand I am tempted to say that if there are two different sets of records in the same dataset, then this means that the database believes that these are two different officers and it seems that we would be loosing information by merging them. On the other hand, given that the list of "events" match (almost) exactly it is very tempting to merge them.
Here is a hypothesis: whenever there is an update to an officer's profile in this database (for example to fix/update the race or appointment_date), instead of updating the rows, their data systems simply duplicates all the entries for this officer in the award table? If this sounds plausible, this would suggest only keeping the second set of records whenever this occurs. What do you think? Really not sure about this, I am as puzzled as you :)
In some sense, this seems a bit similar to the "dead entry" situation in the unit assignment history. For whatever reason, a duplication of officers is introduced in their database, and we should treat the entries before duplication as "dead" entries. We don't know exactly what causes a duplication to occur, but it is easy enough to imagine a bad UI, where instead of clicking "update record", the user clicks a button which somehow creates a new record from an existing one.
nstead of updating the rows, their data systems simply duplicates all the entries for this officer in the award table
this was essentially my hypothesis too.
OK, in src/merge_awards.py
I do a bit of hand-cleaning to avoid obvious issues like the above. But I'm not handling the above pattern in a general way, so let's leave this issue open as a TODO item.
Related: I did a bit of experimentation with handling this in a general way. There are about 140 entries for which there are problems like the above. There are about 750,000 award entries total, so maybe not a huge deal.
I see a few cases in the awards data where records are duplicated up to just a single discrepancy.
@Thibauth what do you think we should do about these? There aren't a huge number of them. For now I am keeping them as separate UIDs, but in these cases I think we might want to do something smarter...
Additional info: I am at least somewhat certain these records are from the same DB as the other CPD records (I see similar typos, extra spaces, etc in the names).
Some examples:
BOBBIE HALL here has two groups of duplicate records: one for
race = BLACK
and one forrace = ASIAN/PACIFIC ISLANDER
. Otherwise you can see that the tracking numbers, award ref numbers, request dates, etc all match up. HALL appears nowhere else in any of the data.Similar situation for DAVID LAVIN, who has two
appointment_date
s:2007-07-05
and1999-11-15
, but otherwise identical records:another LAVIN block elsewhere in the awards data: here you can see one additional entry for the 2000-appointed LAVIN.
The 1999 LAVIN does appear in the rest of the data (in the unit history, roster, etc); the 2000 LAVIN does not.