ayyubibrahimi / us-post-data

MIT License
2 stars 1 forks source link

handling anonymized records #15

Closed tarakc02 closed 2 days ago

tarakc02 commented 2 weeks ago

should we assign unique ids to each of these? example from california:

      person_nbr      full_name                    agcy_name    type  rank start_date    end_date
POST ID Withheld  NAME WITHHELD                  ADELANTO PD  POLICE   CPL 1996-06-24  2002-02-02
POST ID Withheld  NAME WITHHELD        ALAMEDA CO SD/CORONER  POLICE   TRN 1998-10-28  1999-04-28
POST ID Withheld  NAME WITHHELD        ALAMEDA CO SD/CORONER  POLICE   SGT 1999-04-28         NaN
POST ID Withheld  NAME WITHHELD        ALAMEDA CO SD/CORONER  POLICE    LT 1999-04-28         NaN
POST ID Withheld  NAME WITHHELD        ALAMEDA CO SD/CORONER  POLICE  DPTY 2000-07-26  2007-07-13
             ...            ...                          ...     ...   ...        ...         ...
POST ID Withheld  NAME WITHHELD                   YUBA CO SD  POLICE   RII 2000-06-29  2016-05-11
POST ID Withheld  NAME WITHHELD                   YUBA CO SD  POLICE  DPTY 2006-03-06  2014-10-18
POST ID Withheld  NAME WITHHELD                   YUBA CO SD  POLICE  DPTY 2014-10-13         NaN
POST ID Withheld  NAME WITHHELD                   YUBA CO SD  POLICE  DPTY 2021-07-06         NaN
POST ID Withheld  NAME WITHHELD  YUBA COMMUNITY COLL DIST PD  POLICE    PO 2012-06-04  2013-10-04

because things like grouping/flattening expect the same person_nbr to refer to the same person

ayyubibrahimi commented 2 weeks ago

Idk. My intuition is to drop them and note this within the readme.

tarakc02 commented 2 weeks ago

that seems reasonable to me also. I don't think they're identified consistently across sites, but like this one, not too hard to find for any given site.

tarakc02 commented 2 days ago

the only anons i see like this are in california.