Standardize "candidate" field across elections

NickCrews commented 7 months ago

I am combining this data with the data from 2018 and 2020:

State Precinct-Level Returns 2018: https://doi.org/10.7910/DVN/ZFXEJU
State Precinct-Level Returns 2020: https://doi.org/10.7910/DVN/OKL2K1

After unioning 2018, 2020, and 2022, the encodings for certain candidates are inconsistent:

t.group_by(["year", "candidate"]).agg(n=_.count()).order_by(_.n.desc())

this gives:

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ year  ┃ candidate                 ┃ n      ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ int64 │ string                    │ int64  │
├───────┼───────────────────────────┼────────┤
│  2018 │ YES                       │ 995256 │
│  2018 │ NO                        │ 995229 │
│  2022 │ NO                        │ 757102 │
│  2022 │ YES                       │ 748529 │
│  2020 │ YES                       │ 725251 │
│  2020 │ NO                        │ 725001 │
│  2022 │ UNDERVOTES                │ 609217 │
│  2022 │ WRITE-IN                  │ 607881 │
│  2022 │ OVERVOTES                 │ 587875 │
│  2018 │ UNDERVOTES                │ 427220 │
│  2018 │ OVERVOTES                 │ 348737 │
│  2018 │ WRITEIN                   │ 296645 │
│  2020 │ UNDERVOTES                │ 187498 │
│  2020 │ WRITEIN                   │ 172413 │
│  2022 │ OTHER WRITE-INS           │ 169402 │
│  2020 │ OVERVOTES                 │ 117176 │
│  2022 │ WRITEIN                   │  95898 │
│  2022 │ AGAINST                   │  84561 │
│  2022 │ FOR                       │  84405 │
│  2022 │ NULL                      │  83129 │
│  2022 │ UNDER VOTES               │  77326 │
│  2018 │ UNDER VOTES               │  75499 │
│  2018 │ OVER VOTES                │  75499 │
│  2022 │ OVER VOTES                │  74510 │
│  2018 │ THOMAS P DINAPOLI         │  70227 │
│  2018 │ ANDREW M CUOMO            │  56067 │
│  2022 │ BLANK                     │  55152 │
│  2022 │ ANGELA E UNDERWOOD JACOBS │  53301 │
│  2022 │ GAVIN NEWSOM              │  53301 │
│  2022 │ MALIA M COHEN             │  53301 │
│  2022 │ RICARDO LARA              │  53301 │
│  2022 │ LANHEE J CHEN             │  53301 │
│  2022 │ ELENI KOUNALAKIS          │  53301 │
│  2022 │ ROB BONTA                 │  53301 │
│  2022 │ ROBERT HOWELL             │  53301 │
│  2022 │ NATHAN HOCHMAN            │  53301 │
│  2022 │ BRIAN DAHLE               │  53301 │
│  2018 │ MARSHALL TUCK             │  47266 │
│  2018 │ BETTY T YEE               │  47266 │
│  2018 │ RICARDO LARA              │  47266 │
│  2018 │ ELENI KOUNALAKIS          │  47266 │
│  2018 │ STEVEN C BAILEY           │  47266 │
│  2018 │ FIONA MA                  │  47266 │
│  2018 │ KONSTANTINOS RODITIS      │  47266 │
│  2018 │ GAVIN NEWSOM              │  47266 │
│  2018 │ ED HERNANDEZ              │  47266 │
│  2018 │ STEVE POIZNER             │  47266 │
│  2018 │ ALEX PADILLA              │  47266 │
│  2018 │ XAVIER BECERRA            │  47266 │
│  2018 │ JOHN H COX                │  47266 │
│  2018 │ GREG CONLON               │  47266 │
│  2018 │ TONY K THURMOND           │  47266 │
│  2018 │ MARK P MEUSER             │  47266 │
│  2018 │ LETITIA A JAMES           │  46123 │
│  2018 │ MARC MOLINARO             │  42375 │
│  2022 │ SCATTERING                │  41206 │
│  2022 │ LISA ELLIS                │  32012 │
│  2018 │ KEITH WOFFORD             │  31049 │
│  2022 │ LETITIA A JAMES           │  29003 │
│  2022 │ KATHY C HOCHUL            │  28875 │
│  2022 │ CHARLES E SCHUMER         │  28862 │
│  2022 │ LEE ZELDIN                │  28719 │
│  2022 │ MICHAEL HENRY             │  28682 │
│  2022 │ THOMAS P DINAPOLI         │  28407 │
│  2022 │ JOE PINION                │  28401 │
│  2018 │ BLANK BALLOTS             │  28359 │
│  2022 │ PAUL RODRIGUEZ            │  28123 │
│  2018 │ JONATHAN TRICHTER         │  28107 │
│  2022 │ PUBLIC COUNTER            │  27952 │
│  2022 │ ABSENTEE / MILITARY       │  27604 │
│  2022 │ RAPHAEL WARNOCK           │  27165 │
│  2022 │ HERSCHEL JUNIOR WALKER    │  27165 │
│  2018 │ WRITE-IN                  │  27003 │
│  2022 │ AFFIDAVIT                 │  26506 │
│  2018 │ NONE OF THESE CANDIDATES  │  26494 │
│  2018 │ SCATTERED VOTES           │  20397 │
│  2022 │ NONE OF THESE CANDIDATES  │  20270 │
│  2020 │ REPEALED                  │  19180 │
│  2020 │ MAINTAINED                │  19180 │
│  2018 │ BLANK/VOID                │  18261 │
│  2022 │ YES FOR APPROVAL          │  18021 │
│  2022 │ NO FOR REJECTION          │  18021 │
│  2022 │ BLANK BALLOTS             │  16972 │
│  2022 │ KRYSTLE MATTHEWS          │  16183 │
│  2022 │ MARK HAMMOND              │  16074 │
│  2022 │ MAINTAINED                │  16042 │
│  2022 │ REPEALED                  │  16042 │
│  2020 │ FOR                       │  16035 │
│  2020 │ AGAINST                   │  16035 │
│  2022 │ RICHARD ECKSTROM          │  16027 │
│  2022 │ CURTIS LOFTIS             │  16027 │
│  2022 │ TIM SCOTT                 │  16027 │
│  2022 │ ALAN WILSON               │  16027 │
│  2022 │ ELLEN WEAVER              │  16026 │
│  2022 │ HUGH WEATHERS             │  16026 │
│  2022 │ HENRY MCMASTER            │  16026 │
│  2022 │ ROSEMOUNDA PEGGY BUTLER   │  16025 │
│  2022 │ JOE CUNNINGHAM            │  16024 │
│  2022 │ SARAH E WORK              │  16011 │
│  2022 │ CHRIS NELUMS              │  16000 │
│     … │ …                         │      … │
└───────┴───────────────────────────┴────────┘

[ ] YES vs YES FOR APPROVAL
[ ] NO vs NO FOR REJECTION
[ ] WRITEIN vs WRITE-IN
[ ] YES vs FOR
[ ] NO vs AGAINST
[ ] OVERVOTES vs OVER VOTES
[ ] UNDERVOTES vs UNDER VOTES
[ ] OTHER WRITE-INS
[ ] ABSENTEE / MILITARY (this actually looks like a separate issue, with the MODE getting placed in the wrong column?)
[ ] BLANK vs BLANK BALLOTS

Can we combine or make these more consistent between all these years?

Perhaps it would be useful to have a more explicit encoding for ballot measures? Like a column "kind" that is either "CANDIDATE" or "BALLOT MEASURE"? IDK, maybe overkill. If the existing column is consistent, then users can assume a row with a candidate of "YES" or "NO" means a ballot measure.

sbaltzmit commented 7 months ago

Thanks very much for the attentive comments and the helpful code snippets and visualization. The reason that candidate names are not completely standardized across elections in our precinct data offerings is a subset of a larger fact: joining across years in general is not supported with precinct data. Precincts are very often not the same across years, but it is extremely difficult to tell when they have changed and when they have stayed the same. For further explanation, please see the paper we wrote introducing these datasets:

Importantly, our standardization efforts are aimed at standardizing data within election years, but there is no general way to match precinct-level election results across election years. There is no requirement for a precinct to have the same name from one election to the next, and states may create, abolish, or rename precincts between elections, with or without providing a crosswalk file that matches new precinct names to old precinct names. Worse, precinct boundaries frequently change, so a precinct with the same name in roughly the same location may contain a different population from one election to the next32. We therefore emphasize that users of our datasets are not encouraged to join precincts by name across elections, unless they have verified with a separate source (such as the precinct shapefiles maintained by the United States Elections Project) that the name identifies the same precinct in both years and that the geography of the precinct did not change.

In contrast, our county-level and district-level datasets are designed to be consistent over time, because they describe a unit of geography that is the same over time, so we do make sure to standardize properties like candidate names across elections in those datasets.

Now, I should say that it's easy to agree that we could do simple things like always write "WRITE-IN" instead of "WRITEIN" in every election year, to save users from having to remember to type two different things when they're working with two of our different election datasets. This kind of complete standardization is definitely on our to-do list, but we're a small team, and I don't have a specific date when we plan to take it on. However, insofar as the goal is to facilitate legitimate ways of joining precinct data across elections, those are an edge case -- most ways of joining precinct data across elections require much more data, like national municipal re-precincting records, than we are able to offer.

NickCrews commented 7 months ago

Thanks for the quick responses on all of these. I agree joining between precincts is out of scope. I'm not trying to do that, I'm just looking for the WRITEIN vs WRITE-IN standardization (even within 2022 both of these values appear). Can you please reconsider this much smaller feature? I can even submit a PR that does this if you want.

sbaltzmit commented 7 months ago

Oh absolutely. This is another symptom of the fact that I haven't yet had the time to tackle the cross-state standardization that I'm planning for April-May, and at that stage I absolutely should standardize WRITEIN, YES/NO, etc. I'll re-open with the scope of standardizing across states within 2022. I do agree that it would be good to standardize that kind of thing across years too, but I can't commit to a timeline on that. I can add a note about that into the readme though.

MEDSL / 2022-elections-official

Standardize "candidate" field across elections #9