Open trevorcampbell opened 3 years ago
I have thought a lot about this and I disagree. Given that the names are always pulled from the same database internally, there is no reason why someone's name should sometimes come up as O GRADY and sometimes as OGRADY. So even though there are of course different officers named O GRADY and OGRADY, I don't want to assume a priori that O GRADY and OGRADY is the same person.
One argument in favor of this, is that in all of my linking scripts, I do consider the relaxed criterion where I don't even include the last name (I match for example on first name, birthyear and appointment date) and I only saw extremely few cases where the same officer had the same last name in two slightly different formats. I suspect this only occurs when someone asks for the database to be updated, if for example they notice a typo in their name (the only example which comes to mind is K0NELLY where the O was written 0, which did become fixed at some point in their database).
What you are suggesting would make sense if the names were input by hand, which is not the case, and has the potential to create many false positives (in fact, as per the argument in the previous paragraph, this wouldn't really be the case because in practice you almost never see the same officer with a minor variation in the way their name is written).
Just to make it clearer, I believe that all these variants are in fact useful information which help you disambiguate between officers with common last names which come in many variants, and that by "normalizing" the last names you would introduce a lot of ambiguities, hence doing more harm than good.
I would be willing to change my opinion if you can give me examples where the same officers shows up with many different variants of their last name, but my experience with the data suggests that this would be extremely rare.
OK. What you've said is convincing. But let's leave this open in case I end up finding such a counterexample ;)
@Thibauth one important point that I disagree on
Given that the names are always pulled from the same database internally,
This is not true. Some of our data was from the City of Chicago Dept of Human Resources (e.g. the salary data), which appears to be a different DB than the CPD DB that a lot of other records are pulled from.
I am running into this issue again in the following case:
VELAZQUEZ,ISABEL,,Female,,,,,2017-03-16,,,,,,,,,,,,,,,,,,,,,salary,0c3e3405-7c55-4dbf-8d39-3a0e7398d32d
VELAZQUEZ,ISABEL, ,F,WHITE HISPANIC,1994,22,Y,2017-03-16,9161,POLICE OFFICER,,,,,,,,,,,,,,,,,,,P0-58155,577e43fb-1b63-4b6f-9485-aa1bbb66c588
VELAZQUEZ,ISABEL, ,F,WHITE HISPANIC,,23,,2017-03-16,9161,,044,,,,,,,,,,,,,,Y,27,022,12475,P4-41436,577e43fb-1b63-4b6f-9485-aa1bbb66c588
it seems the MI was entered as a single space in the CPD DB, while in the HR DB it was entered as empty string.
I can handle this for the salary data, but you may want to look through the other data to make sure you do believe it was all from the same DB.
Thanks a lot. For all the other datasets for which the SQL query was available, I was consistently seeing the same table name, and for the other ones, the similarity in field names doesn't give me any reason to believe this was a separate database.
I could see however that the HR might be a very different service with a different database and we need to be very careful there (for example they might have done their own normalization of names internally and stripped white spaces).
On Sat, Aug 21, 2021, 18:58 Trevor Campbell @.***> wrote:
@Thibauth https://github.com/Thibauth one important point that I disagree on
Given that the names are always pulled from the same database internally,
This is not true. Some of our data was from the City of Chicago Dept of Human Resources (e.g. the salary data), which appears to be a different DB than the CPD DB that a lot of other records are pulled from.
I am running into this issue again in the following case:
VELAZQUEZ,ISABEL,,Female,,,,,2017-03-16,,,,,,,,,,,,,,,,,,,,,salary,0c3e3405-7c55-4dbf-8d39-3a0e7398d32d VELAZQUEZ,ISABEL, ,F,WHITE HISPANIC,1994,22,Y,2017-03-16,9161,POLICE OFFICER,,,,,,,,,,,,,,,,,,,P0-58155,577e43fb-1b63-4b6f-9485-aa1bbb66c588 VELAZQUEZ,ISABEL, ,F,WHITE HISPANIC,,23,,2017-03-16,9161,,044,,,,,,,,,,,,,,Y,27,022,12475,P4-41436,577e43fb-1b63-4b6f-9485-aa1bbb66c588
it seems the MI was entered as a single space in the CPD DB, while in the HR DB it was entered as empty string.
I can handle this for the salary data, but you may want to look through the other data to make sure you do believe it was all from the same DB.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/chicago-police-violence/data/issues/24#issuecomment-903185984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKTMHONHPZBP2DYAO4J4C3T6AVQBANCNFSM5CRK6OJA .
@Thibauth I don't see anywhere in the code where you handle standardization of names prior to attempting to link, e.g.
should all be assumed to match. Similar story for JR/JR. and I II III etc.
Did I miss something obvious or are we not handling different typesetting of names?