chicago-police-violence / data

Dataset about the personnel, use of force, and complaints in the Chicago Police Department
MIT License
7 stars 0 forks source link

awards duplicated entries #27

Open trevorcampbell opened 3 years ago

trevorcampbell commented 3 years ago

I see a few cases in the awards data where records are duplicated up to just a single discrepancy.

@Thibauth what do you think we should do about these? There aren't a huge number of them. For now I am keeping them as separate UIDs, but in these cases I think we might want to do something smarter...

Additional info: I am at least somewhat certain these records are from the same DB as the other CPD records (I see similar typos, extra spaces, etc in the names).

Some examples:

BOBBIE HALL here has two groups of duplicate records: one for race = BLACK and one for race = ASIAN/PACIFIC ISLANDER. Otherwise you can see that the tracking numbers, award ref numbers, request dates, etc all match up. HALL appears nowhere else in any of the data.

HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8076980,HONORABLE MENTION,2011-09-24,HT311379,DELETED,2011-05-24,2011-05-25,,KAPA,CHRISTOPH
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8087668,HONORABLE MENTION,2013-03-02,HW153229,DELETED,2013-02-11,2013-02-11,,MIHAJLOV,ROBERT
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8088191,HONORABLE MENTION,2013-04-03,HW205710,DELETED,2013-03-27,2013-03-27,,MOSQUERA,MARINA
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8090566,HONORABLE MENTION,2013-08-08,HW 225281,DELETED,2013-04-09,2013-04-09,,OCONNOR,DANIEL
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8092388,SUPERINTENDENTS AWARD OF VALOR,2013-11-07,HW274799,DELETED,2013-05-14,2013-05-14,,OCONNOR,DANIEL
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8096489,HONORABLE MENTION,2014-07-04,HX330026,DELETED,2014-07-03,2014-07-03,,CLEARY,THOMAS
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8097747,HONORABLE MENTION,2014-08-16,,DELETED,2014-08-16,2014-08-16,,BUTLER,JAMES
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8097748,HONORABLE MENTION,2014-08-16,,DELETED,2014-08-16,2014-08-16,,BUTLER,JAMES
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8105737,HONORABLE MENTION,2015-11-14,HY437386,DELETED,2015-09-25,2015-09-25,,NOTTOLI,MARK
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8109206,SUPERINTENDENT'S AWARD OF TACTICAL EXCELLENCE,2016-06-30,HZ255575,DELETED,2016-05-07,2016-05-07,,SANCHEZ,NOEL
HALL,BOBBIE,,1952,F,ASIAN/PACIFIC ISLANDER,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8113746,HONORABLE MENTION,2017-04-11,JA196900,DELETED,2017-03-22,2017-03-23,,CARROLL,JOSEPH
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8076980,HONORABLE MENTION,2011-09-24,HT311379,DELETED,2011-05-24,2011-05-25,,KAPA,CHRISTOPH
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8087668,HONORABLE MENTION,2013-03-02,HW153229,DELETED,2013-02-11,2013-02-11,,MIHAJLOV,ROBERT
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8088191,HONORABLE MENTION,2013-04-03,HW205710,DELETED,2013-03-27,2013-03-27,,MOSQUERA,MARINA
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8090566,HONORABLE MENTION,2013-08-08,HW 225281,DELETED,2013-04-09,2013-04-09,,OCONNOR,DANIEL
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8092388,SUPERINTENDENTS AWARD OF VALOR,2013-11-07,HW274799,DELETED,2013-05-14,2013-05-14,,OCONNOR,DANIEL
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8096489,HONORABLE MENTION,2014-07-04,HX330026,DELETED,2014-07-03,2014-07-03,,CLEARY,THOMAS
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8097747,HONORABLE MENTION,2014-08-16,,DELETED,2014-08-16,2014-08-16,,BUTLER,JAMES
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8097748,HONORABLE MENTION,2014-08-16,,DELETED,2014-08-16,2014-08-16,,BUTLER,JAMES
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8105737,HONORABLE MENTION,2015-11-14,HY437386,DELETED,2015-09-25,2015-09-25,,NOTTOLI,MARK
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8109206,SUPERINTENDENT'S AWARD OF TACTICAL EXCELLENCE,2016-06-30,HZ255575,DELETED,2016-05-07,2016-05-07,,SANCHEZ,NOEL
HALL,BOBBIE,,1952,F,BLACK,,1999-11-02,9111-CROSSING GUARD,,2006-07-09,8113746,HONORABLE MENTION,2017-04-11,JA196900,DELETED,2017-03-22,2017-03-23,,CARROLL,JOSEPH

Similar situation for DAVID LAVIN, who has two appointment_dates: 2007-07-05 and 1999-11-15, but otherwise identical records:

LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8053444,PROBLEM SOLVING AWARD,2009-04-19,HR-184256 ,DELETED,2009-03-10,2009-03-26,,LAJEWSKI,ROBERT
LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8068541,OUTSIDE GOVERNMENTAL AGENCY AWARD,2010-11-07,HS 475331,DELETED,2010-08-22,2010-08-23,,VACLAVIK,JOSEPH
LAVIN,DAVID,P,1967,M,WHITE,,1999-11-15,9161-POLICE OFFICER,,2002-01-04,8053444,PROBLEM SOLVING AWARD,2009-04-19,HR-184256 ,DELETED,2009-03-10,2009-03-26,,LAJEWSKI,ROBERT
LAVIN,DAVID,P,1967,M,WHITE,,1999-11-15,9161-POLICE OFFICER,,2002-01-04,8068541,OUTSIDE GOVERNMENTAL AGENCY AWARD,2010-11-07,HS 475331,DELETED,2010-08-22,2010-08-23,,VACLAVIK,JOSEPH

another LAVIN block elsewhere in the awards data: here you can see one additional entry for the 2000-appointed LAVIN.

LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8074708,COMPLIMENTARY LETTER,2011-06-14,,FINAL,2010-01-12,,,JOHNSON,PRISCILLA
LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8075036,COMPLIMENTARY LETTER,2011-06-30,,FINAL,2011-06-13,,,HOOVER,TRACY
LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8077259,COMPLIMENTARY LETTER,2011-10-05,,FINAL,2011-08-30,,,HOOVER,TRACY
LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8080662,COMPLIMENTARY LETTER,2012-03-15,,FINAL,2012-01-18,,,JOHNSON,PRISCILLA
LAVIN,DAVID,P,1967,M,WHITE,,2000-07-05,9161-POLICE OFFICER,,,8084268,HONORABLE MENTION,2012-10-16,HV409998,FINAL,2012-08-01,2012-08-01,,PIKOWSKI,RANDY
LAVIN,DAVID,P,1967,M,WHITE,,1999-11-15,9161-POLICE OFFICER,,2002-01-04,8074708,COMPLIMENTARY LETTER,2011-06-14,,FINAL,2010-01-12,,,JOHNSON,PRISCILLA
LAVIN,DAVID,P,1967,M,WHITE,,1999-11-15,9161-POLICE OFFICER,,2002-01-04,8075036,COMPLIMENTARY LETTER,2011-06-30,,FINAL,2011-06-13,,,HOOVER,TRACY
LAVIN,DAVID,P,1967,M,WHITE,,1999-11-15,9161-POLICE OFFICER,,2002-01-04,8077259,COMPLIMENTARY LETTER,2011-10-05,,FINAL,2011-08-30,,,HOOVER,TRACY
LAVIN,DAVID,P,1967,M,WHITE,,1999-11-15,9161-POLICE OFFICER,,2002-01-04,8080662,COMPLIMENTARY LETTER,2012-03-15,,FINAL,2012-01-18,,,JOHNSON,PRISCILLA

The 1999 LAVIN does appear in the rest of the data (in the unit history, roster, etc); the 2000 LAVIN does not.

trevorcampbell commented 3 years ago

Another one. This one there are missing / "unknown" entries, but the award ref numbers, tracking numbers, and requester details match.

KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8093112,HONORABLE MENTION,2013-12-22,HW502242,DELETED,2013-10-21,2013-10-21,,LARSON,ROBERT,9161,POLICE OFFICER
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8099927,HONORABLE MENTION,2014-12-23,HX-501556,DELETED,2014-11-10,2014-11-10,,PALUCH,PHILIP,9161,POLICE OFFICER
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8106187,DEPARTMENT COMMENDATION,2015-12-10,HY401082,DELETED,2015-08-28,2015-08-28,,SMITH,TIMOTHY,9161,POLICE OFFICER
KORDECK,HERBERT,C,1936,M,WHITE,,1962-02-26,,1990-06-08,8111944,DEPARTMENT COMMENDATION,2016-12-30,HZ512156,DELETED,2016-10-05,2016-11-12,,BARRY,KEVIN,9161,POLICE OFFICER
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8093112,HONORABLE MENTION,2013-12-22,HW502242,DELETED,2013-10-21,2013-10-21,,LARSON,ROBERT,,
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8099927,HONORABLE MENTION,2014-12-23,HX-501556,DELETED,2014-11-10,2014-11-10,,PALUCH,PHILIP,,
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8106187,DEPARTMENT COMMENDATION,2015-12-10,HY401082,DELETED,2015-08-28,2015-08-28,,SMITH,TIMOTHY,,
KORDECK,HERBERT,C,,X,UNKNOWN,,,,,8111944,DEPARTMENT COMMENDATION,2016-12-30,HZ512156,DELETED,2016-10-05,2016-11-12,,BARRY,KEVIN,,
Thibauth commented 3 years ago

Tough call. On the one hand I am tempted to say that if there are two different sets of records in the same dataset, then this means that the database believes that these are two different officers and it seems that we would be loosing information by merging them. On the other hand, given that the list of "events" match (almost) exactly it is very tempting to merge them.

Here is a hypothesis: whenever there is an update to an officer's profile in this database (for example to fix/update the race or appointment_date), instead of updating the rows, their data systems simply duplicates all the entries for this officer in the award table? If this sounds plausible, this would suggest only keeping the second set of records whenever this occurs. What do you think? Really not sure about this, I am as puzzled as you :)

In some sense, this seems a bit similar to the "dead entry" situation in the unit assignment history. For whatever reason, a duplication of officers is introduced in their database, and we should treat the entries before duplication as "dead" entries. We don't know exactly what causes a duplication to occur, but it is easy enough to imagine a bad UI, where instead of clicking "update record", the user clicks a button which somehow creates a new record from an existing one.

trevorcampbell commented 3 years ago

nstead of updating the rows, their data systems simply duplicates all the entries for this officer in the award table

this was essentially my hypothesis too.

OK, in src/merge_awards.py I do a bit of hand-cleaning to avoid obvious issues like the above. But I'm not handling the above pattern in a general way, so let's leave this issue open as a TODO item.

trevorcampbell commented 3 years ago

Related: I did a bit of experimentation with handling this in a general way. There are about 140 entries for which there are problems like the above. There are about 750,000 award entries total, so maybe not a huge deal.