Closed jeturcotte closed 2 years ago
Looks like it. You should probably go to the original sources just to check it out.
Ugh.
-cs
Charles Stewart III Kenan Sahin Distinguished Professor of Political Science The Massachusetts Institute of Technology Cambridge, Massachusetts 02139 617-253-3127 @.**@.>
From: Joshua Eric Turcotte @.> Sent: Tuesday, July 13, 2021 8:59 AM To: MEDSL/2018-elections-official @.> Cc: Subscribed @.***> Subject: [MEDSL/2018-elections-official] Did RCV mess with the data collection for Maine in 2018? (#14)
So, I took your data for 2018 elections and ferreted out votes for US. Representative, aggregated tallies first by district (excluding candidates) and then by county (just to be safe)... and paired this with 2018 population estimates by county so that I could study if there was a trend in population density vs. voting tallies.
I noticed a problem (in my home state, no less.)
county state pop sq.mi pop.density vote total vpp
69 Androscoggin ME 107914 467.93 230.61996 total 230655 2.137396
98 Aroostook ME 67318 6671.33 10.09064 total 116305 1.727695
1373 Franklin ME 29915 1696.61 17.63222 total 68260 2.281798
1668 Hancock ME 54734 1586.89 34.49136 total 129620 2.368181
2092 Kennebec ME 122044 867.52 140.68148 total 153928 1.261250
3005 Oxford ME 57754 2076.84 27.80859 total 149480 2.588219
3065 Penobscot ME 151817 3397.36 44.68676 total 300850 1.981662
3151 Piscataquis ME 16746 3960.86 4.22787 total 37575 2.243819
3611 Somerset ME 50489 3924.40 12.86541 total 105485 2.089267
3980 Waldo ME 39657 729.92 54.33061 total 98225 2.476864
4056 Washington ME 31321 2562.66 12.22207 total 70935 2.264774
vvp is votes per person.
Here's my work (i could easily be to blame, but)
%not in%
<- Negate(%in%
)
votes <- read.csv('county_2018.csv')
population <- read.csv('county.population.estimate.csv')
land.area <- read.csv('land.area.csv')
population$county <- gsub(' County','',sub('.','', rownames(population)))
population$X2018 <- as.numeric(gsub(',','',population$X2018))
population <- population %>% separate( county, c('county','state'), sep=', ' )
population <- population[ ,c( 'county', 'state', 'X2018' ) ]
population$state <- state.abb[match(population$state, state.name)]
colnames(population) <- c('county','state','population')
population[ is.na(population$state), 'state' ] <- 'DC'
rownames(population) <- tolower(paste0(population$county,' ',population$state))
land.area <- land.area[ grep(',',land.area$area), ]
land.area <- land.area[ rownames(land.area) %not in% c(2966,2976,2978,2996,2997), ]
rownames(land.area) <- tolower(gsub(',','',land.area$area))
stats <- merge(land.area, population, by=0, all=FALSE)
stats <- stats[ ,c('Row.names','county','state','population','sq.mi') ]
colnames(stats) <- c('mergename','county','state','population','sq.mi')
stats$density <- stats$population / stats$sq.mi
repvote <- votes[ votes$office == 'US Representative', ]
repvote$mergename <- tolower(paste0(repvote$county,' ',repvote$state_po))
repvote <- aggregate( candidatevotes ~ mergename + mode + district, repvote, FUN=sum )
repvote <- aggregate( candidatevotes ~ mergename + mode, repvote, FUN=sum )
stats <- merge(stats,repvote,by='mergename')
stats <- stats[ ,c('county','state','population','sq.mi','density','mode','candidatevotes') ]
colnames(stats) <- c('county','state','pop','sq.mi','pop.density', 'vote', 'total')
stats[stats$vote %in% c('absentee','absentee by mail','absentee mail','mail ballots'), 'vote'] <- 'mail'
stats[stats$vote %in% c('absentee/early vote','advance in person','early','early vote'), 'vote'] <- 'early'
stats[stats$vote %in% c('election','election day','electon day','machine','one stop'), 'vote'] <- 'in person'
stats <- stats[ stats$total != 0, ]
stats <- stats[ stats$pop.density != 0, ]
stats$vpp <- stats$total / stats$pop
no.total <- stats[ stats$vote != 'total', ]
only.total <- stats[ stats$vote == 'total', ]
library(ggplot2)
require(scales)
ggplot(only.total, aes(x=pop.density, y=vpp, color=vote)) + geom_point(shape=23,alpha=0.33) + geom_smooth() + theme_classic() + scale_x_log10() + scale_y_continuous(trans = log2_trans()) + ggtitle("Voting Habits by County Population Density\n(2018 Midterm Elections)") + xlab("Population per Square Mile") + ylab("Votes Cast Per Person") + geom_vline(xintercept=1000, alpha=0.25) + geom_vline(xintercept=500, alpha=0.25) + annotate("text", x = 120, y = -0.065, label = "(pop/area): U.S. Census\n(votes): MIT Election Data and Science Lab") + annotate("text", x=1100, y=0.4, label='urban line', angle=90, alpha=0.25) + annotate("text", x=550, y=0.4, label='rural line', angle=90, alpha=0.25)
I don't see the same problem with anything under than mode==total, by the way.
Now, I'm pretty sure that MAINE(ME) had started using ranked choice voting by then, so I have to wonder if those who collected the data for this project missed that and effectively counted the first and second choice candidates as multiple votes instead of just one ballot cast?
If you see a dumb error on my part, feel free to point it out.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/MEDSL/2018-elections-official/issues/14, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOJTMZJVIZYP56TIPLWOW3TXRIETANCNFSM5AJKINFQ.
We've been cleaning and reorganizing these files and I wanted to make sure that we haven't retained this problem, so I'd like to ask a few points of clarification.
One initial observation is that it can't be caused by instant runoff voting reallocations, because our data do not include US House district 2 (we're working to include it, but some work needs to be done to turn the ballot-level data the state provides into precinct-level returns of how much each ballot contributed to the ultimate results). District 1 was decided by a majority so no reallocations needed to occur.
Now maybe it could have been second place votes just being binned as first place votes, but our vote totals match the official vote totals: the sum of votes for US house across all candidates and parties in our file is 349963, which exactly matches the total reported by the state (https://www.maine.gov/sos/cec/elec/results/index.html). Just to make sure the state didn't somehow bin multiple choices on a ballot into multiple votes for a candidate, I compared them to the 2014 county totals, and the 2018 numbers by county look extremely similar to the reported vote totals for the last plurality midterm election in district 1, so I don't think a mistake like that occurred either on our end or on the state's.
So here are a few questions to track down your issue. First, if you run the same code on the state's county totals, you should see the same problem, since they're basically the same numbers as ours, so that's worth checking. Second, I wonder how the situation looks if you use official turnout numbers rather than estimating total county population. Third, are you binning in substantial numbers of undervotes or overvotes?
Closing issue for now, please follow up if further exploration turns up an issue
So, I took your data for 2018 elections and ferreted out votes for US. Representative, aggregated tallies first by district (excluding candidates) and then by county (just to be safe)... and paired this with 2018 population estimates by county so that I could study if there was a trend in population density vs. voting tallies.
I noticed a problem (in my home state, no less.)
vvp is votes per person.
Here's my work (i could easily be to blame, but)
I don't see the same problem with anything under than mode==total, by the way.
Now, I'm pretty sure that MAINE(ME) had started using ranked choice voting by then, so I have to wonder if those who collected the data for this project missed that and effectively counted the first and second choice candidates as multiple votes instead of just one ballot cast?
If you see a dumb error on my part, feel free to point it out.