MEDSL / 2022-elections-official

Official returns for the 2022 Midterm Elections
16 stars 4 forks source link

Many rows are duplicated except for the "votes" column #15

Closed NickCrews closed 8 months ago

NickCrews commented 8 months ago
from pathlib import Path

csvs = list(Path("data/github_2022/").glob("*.csv"))

def duped_rows(t):
    return (
        t.group_by(
            [c for c in t.columns if c != "votes"],
        )
        .mutate(
            n=_.count(),
        )
        .order_by(
            _.n.desc(),
        )
        .filter(_.n > 1)
    )

dupe_states = []
for csv in csvs:
    t = ibis.read_csv(csv, all_varchar=True)
    duped = duped_rows(t)
    n_dupe_rows = duped.count().execute()
    if n_dupe_rows > 0:
        dupe_states.append((csv, n_dupe_rows))
        print(csv)
        display(duped)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ precinct                    ┃ office           ┃ party_detailed ┃ party_simplified ┃ mode   ┃ votes  ┃ county_name ┃ county_fips ┃ jurisdiction_name ┃ jurisdiction_fips ┃ candidate   ┃ district ┃ dataverse ┃ year   ┃ stage  ┃ state    ┃ special ┃ writein ┃ state_po ┃ state_fips ┃ state_cen ┃ state_ic ┃ date       ┃ readme_check ┃ magnitude ┃ n     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string                      │ string           │ string         │ string           │ string │ string │ string      │ string      │ string            │ string            │ string      │ string   │ string    │ string │ string │ string   │ string  │ string  │ string   │ string     │ string    │ string   │ string     │ string       │ string    │ int64 │
├─────────────────────────────┼──────────────────┼────────────────┼──────────────────┼────────┼────────┼─────────────┼─────────────┼───────────────────┼───────────────────┼─────────────┼──────────┼───────────┼────────┼────────┼──────────┼─────────┼─────────┼──────────┼────────────┼───────────┼──────────┼────────────┼──────────────┼───────────┼───────┤
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 589    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 1016   │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 789    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 937    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 730    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 926    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 464    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 1014   │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 616    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ Charter Township of Canton, │ ATTORNEY GENERAL │ DEMOCRAT       │ DEMOCRAT         │ TOTAL  │ 866    │ WAYNE       │ 26163       │ WAYNE             │ 26163             │ DANA NESSEL │ NULL     │ STATE     │ 2022   │ GEN    │ MICHIGAN │ FALSE   │ FALSE   │ MI       │ 26         │ 34        │ 23       │ 2022-11-08 │ TRUE         │ 1         │   496 │
│ …                           │ …                │ …              │ …                │ …      │ …      │ …           │ …           │ …                 │ …                 │ …           │ …        │ …         │ …      │ …      │ …        │ …       │ …       │ …        │ …          │ …         │ …        │ …          │ …            │ …         │     … │
└─────────────────────────────┴──────────────────┴────────────────┴──────────────────┴────────┴────────┴─────────────┴─────────────┴───────────────────┴───────────────────┴─────────────┴──────────┴───────────┴────────┴────────┴──────────┴─────────┴─────────┴──────────┴────────────┴───────────┴──────────┴────────────┴──────────────┴───────────┴───────┘

This is a problem for a variety of states:

[(PosixPath('data/github_2022/south_carolina_cleaned.csv'), 38493),
 (PosixPath('data/github_2022/mi22_cleaned.csv'), 36230),
 (PosixPath('data/github_2022/nd22_cleaned.csv'), 799),
 (PosixPath('data/github_2022/id22_cleaned.csv'), 538),
 (PosixPath('data/github_2022/wa22_cleaned.csv'), 15990),
 (PosixPath('data/github_2022/ms_cleaned22.csv'), 4),
 (PosixPath('data/github_2022/va22_cleaned.csv'), 2),
 (PosixPath('data/github_2022/ny22_cleaned.csv'), 8),
 (PosixPath('data/github_2022/me_final.csv'), 21),
 (PosixPath('data/github_2022/RI-cleaned2.csv'), 22),
 (PosixPath('data/github_2022/ct22_cleaned.csv'), 3290),
 (PosixPath('data/github_2022/AZ-cleaned.csv'), 8)]
sbaltzmit commented 8 months ago

@NickCrews first let me reiterate, as I've said several times today, that I really appreciate your energy in pointing these issues out to us. It really has been very helpful. Please, though, when you open an issue in this repository, make sure to explain why it's an issue, and make sure that it's not addressed in the readme. We check for near-duplicates in every state and it's an issue that we work hard to address and that we discuss in the readme when it's not resolvable. For most if not all of those states, the readme explains the cause of the near-duplicates. The Michigan section specifically says "There are substantial rows that are duplicated, or duplicated up to vote totals", and then explains why. Code snippets that don't engage with why something is the case or how it could be improved are not a sufficient Issue for me to act on. As a ground rule of respectful engagement with me, before opening another issue, please pause and make sure that the concern you're raising is a problem that you can explain and that you are sure has not been addressed in the readme.