MEDSL / 2020-elections-official

8 stars 72 forks source link

Inconsistent coding of "district" for single-member House states #4

Closed tdsmith closed 2 years ago

tdsmith commented 2 years ago

The district column is coded inconsistently for states that elect a single US House member.

In particular, Vermont reports state House districts:

house = pd.read_csv(medsl_path / "HOUSE/HOUSE_precinct_general.zip", dtype=medsl_dtypes)

# Identify rows where `district` is not a three-digit value or "STATEWIDE" and count states
house.loc[~house.district.str.contains(r"\d{3}").astype(bool) & (house.district != "STATEWIDE"), "state"].value_counts()
# VERMONT    3167
# Name: state, dtype: int64

# All Vermont rows are affected
house.loc[house.state == "VERMONT",].shape
# (3167, 25)

# Examples
house.loc[house.state == "VERMONT", "district"].value_counts().head()
# ESX-CAL-ORL    86
# ESX-CAL        81
# ORL-CAL        79
# GI-CHI         67
# ADD-2          63
# Name: district, dtype: int64

And other jurisdictions use a mixture of 000, STATEWIDE, and missingness:

districts = house[["state", "district"]].drop_duplicates()
district_count = districts.groupby("state")["district"].size().reset_index(name="size")
districts.merge(district_count, how="inner", on="state").query("size == 1")
state district size
ALASKA 000 1
DELAWARE 000 1
DISTRICT OF COLUMBIA NaN 1
MONTANA 000 1
NORTH DAKOTA 000 1
SOUTH DAKOTA STATEWIDE 1
WYOMING 000 1

I think 000 would be the least surprising value for these.

cstewartiii commented 2 years ago

Thanks for drawing this to our attention. There are two issues: (1) the nonstandard treatment of at-large districts and (2) the special issues with Vermont. I won’t bore you with the Vermont issue, but we have that under control and are in the process of correcting it.

Best,

-cs


Charles Stewart III Kenan Sahin Distinguished Professor of Political Science Director, MIT Election Data and Science Lab Co-Director, Caltech/MIT Voting Technology Project

Department of Political Science The Massachusetts Institute of Technology Cambridge, Massachusetts 02139 617-253-3127 @.***

From: Tim D. Smith @.> Sent: Thursday, March 3, 2022 2:19 AM To: MEDSL/2020-elections-official @.> Cc: Subscribed @.***> Subject: [MEDSL/2020-elections-official] Inconsistent coding of "district" for single-member House states (Issue #4)

The district column is coded inconsistently for states that elect a single US House member.

In particular, Vermont reports state House districts:

house = pd.read_csv(medsl_path / "HOUSE/HOUSE_precinct_general.zip", dtype=medsl_dtypes)

Identify rows where district is not a three-digit value or "STATEWIDE" and count states

house.loc[~house.district.str.contains(r"\d{3}").astype(bool) & (house.district != "STATEWIDE"), "state"].value_counts()

VERMONT 3167

Name: state, dtype: int64

All Vermont rows are affected

house.loc[house.state == "VERMONT",].shape

(3167, 25)

Examples

house.loc[house.state == "VERMONT", "district"].value_counts().head()

ESX-CAL-ORL 86

ESX-CAL 81

ORL-CAL 79

GI-CHI 67

ADD-2 63

Name: district, dtype: int64

And other jurisdictions use a mixture of 000, STATEWIDE, and missingness:

districts = house[["state", "district"]].drop_duplicates()

district_count = districts.groupby("state")["district"].size().reset_index(name="size")

districts.merge(district_count, how="inner", on="state").query("size == 1") state district size ALASKA 000 1 DELAWARE 000 1 DISTRICT OF COLUMBIA NaN 1 MONTANA 000 1 NORTH DAKOTA 000 1 SOUTH DAKOTA STATEWIDE 1 WYOMING 000 1

I think 000 would be the least surprising value for these.

— Reply to this email directly, view it on GitHubhttps://github.com/MEDSL/2020-elections-official/issues/4, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOJTM4LSE6H4BVBULZUL5DU6BRVTANCNFSM5PZTZPFQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

declanc2021 commented 2 years ago

Hi @tdsmith, thanks a lot for raising this issue.

(1) The at-large districts have been standardized to "000". (2) The Vermont issue has also been fixed. The US House districts here are now "000" as well. The information previously stored in the district field is now nested within the precinct field. A more detailed explanation of this can be found in our README.md.

I am closing this issue now.

tdsmith commented 2 years ago

Thank you!