jeturcotte commented 3 years ago

So, I took your data for 2018 elections and ferreted out votes for US. Representative, aggregated tallies first by district (excluding candidates) and then by county (just to be safe)... and paired this with 2018 population estimates by county so that I could study if there was a trend in population density vs. voting tallies.

I noticed a problem (in my home state, no less.)

           county state    pop   sq.mi pop.density  vote  total      vpp
69   Androscoggin    ME 107914  467.93   230.61996 total 230655 2.137396
98      Aroostook    ME  67318 6671.33    10.09064 total 116305 1.727695
1373     Franklin    ME  29915 1696.61    17.63222 total  68260 2.281798
1668      Hancock    ME  54734 1586.89    34.49136 total 129620 2.368181
2092     Kennebec    ME 122044  867.52   140.68148 total 153928 1.261250
3005       Oxford    ME  57754 2076.84    27.80859 total 149480 2.588219
3065    Penobscot    ME 151817 3397.36    44.68676 total 300850 1.981662
3151  Piscataquis    ME  16746 3960.86     4.22787 total  37575 2.243819
3611     Somerset    ME  50489 3924.40    12.86541 total 105485 2.089267
3980        Waldo    ME  39657  729.92    54.33061 total  98225 2.476864
4056   Washington    ME  31321 2562.66    12.22207 total  70935 2.264774

vvp is votes per person.

Here's my work (i could easily be to blame, but)


`%not in%` <- Negate(`%in%`)

votes <- read.csv('county_2018.csv')
population <- read.csv('county.population.estimate.csv')
land.area <- read.csv('land.area.csv')

population$county <- gsub(' County','',sub('.','', rownames(population)))
population$X2018 <- as.numeric(gsub(',','',population$X2018))
population <- population %>% separate( county, c('county','state'), sep=', ' )
population <- population[ ,c( 'county', 'state', 'X2018' ) ]
population$state <- state.abb[match(population$state, state.name)]
colnames(population) <- c('county','state','population')
population[ is.na(population$state), 'state' ] <- 'DC'
rownames(population) <- tolower(paste0(population$county,' ',population$state))

land.area <- land.area[ grep(',',land.area$area), ]
land.area <- land.area[ rownames(land.area) %not in% c(2966,2976,2978,2996,2997), ]
rownames(land.area) <- tolower(gsub(',','',land.area$area))

stats <- merge(land.area, population, by=0, all=FALSE) 
stats <- stats[ ,c('Row.names','county','state','population','sq.mi') ]
colnames(stats) <- c('mergename','county','state','population','sq.mi')
stats$density <- stats$population / stats$sq.mi

repvote <- votes[ votes$office == 'US Representative', ]
repvote$mergename <- tolower(paste0(repvote$county,' ',repvote$state_po))
#repvote <- repvote[ ,c('mergename','mode','candidatevotes') ]
#repvote <- repvote %>% distinct()
  # don't care about individual candidates or party right now, just turnout
  # but do need to temporarily protect district counts so they dont get misapplied
repvote <- aggregate( candidatevotes ~ mergename + mode + district, repvote, FUN=sum )
  # but now we DONT need to, since what we want is by-county, not by-district
repvote <- aggregate( candidatevotes ~ mergename + mode, repvote, FUN=sum )

stats <- merge(stats,repvote,by='mergename')
stats <- stats[ ,c('county','state','population','sq.mi','density','mode','candidatevotes') ]
colnames(stats) <- c('county','state','pop','sq.mi','pop.density', 'vote', 'total')

stats[stats$vote %in% c('absentee','absentee by mail','absentee mail','mail ballots'), 'vote'] <- 'mail'
stats[stats$vote %in% c('absentee/early vote','advance in person','early','early vote'), 'vote'] <- 'early'
stats[stats$vote %in% c('election','election day','electon day','machine','one stop'), 'vote'] <- 'in person'
stats <- stats[ stats$total != 0, ]
stats <- stats[ stats$pop.density != 0, ]
stats$vpp <- stats$total / stats$pop

no.total <- stats[ stats$vote != 'total', ]
only.total <- stats[ stats$vote == 'total', ]

library(ggplot2)
require(scales)

ggplot(only.total, aes(x=pop.density, y=vpp, color=vote)) + geom_point(shape=23,alpha=0.33) + geom_smooth() + theme_classic() + scale_x_log10() + scale_y_continuous(trans = log2_trans()) + ggtitle("Voting Habits by County Population Density\n(2018 Midterm Elections)") + xlab("Population per Square Mile") + ylab("Votes Cast Per Person") + geom_vline(xintercept=1000, alpha=0.25) + geom_vline(xintercept=500, alpha=0.25) + annotate("text", x = 120, y = -0.065, label = "(pop/area): U.S. Census\n(votes): MIT Election Data and Science Lab") + annotate("text", x=1100, y=0.4, label='urban line', angle=90, alpha=0.25) + annotate("text", x=550, y=0.4, label='rural line', angle=90, alpha=0.25)

I don't see the same problem with anything under than mode==total, by the way.

Now, I'm pretty sure that MAINE(ME) had started using ranked choice voting by then, so I have to wonder if those who collected the data for this project missed that and effectively counted the first and second choice candidates as multiple votes instead of just one ballot cast?

If you see a dumb error on my part, feel free to point it out.

cstewartiii commented 3 years ago

Looks like it. You should probably go to the original sources just to check it out.

Ugh.

-cs

Charles Stewart III Kenan Sahin Distinguished Professor of Political Science The Massachusetts Institute of Technology Cambridge, Massachusetts 02139 617-253-3127 @.**@.>

From: Joshua Eric Turcotte @.> Sent: Tuesday, July 13, 2021 8:59 AM To: MEDSL/2018-elections-official @.> Cc: Subscribed @.***> Subject: [MEDSL/2018-elections-official] Did RCV mess with the data collection for Maine in 2018? (#14)

So, I took your data for 2018 elections and ferreted out votes for US. Representative, aggregated tallies first by district (excluding candidates) and then by county (just to be safe)... and paired this with 2018 population estimates by county so that I could study if there was a trend in population density vs. voting tallies.

I noticed a problem (in my home state, no less.)

       county state    pop   sq.mi pop.density  vote  total      vpp

69 Androscoggin ME 107914 467.93 230.61996 total 230655 2.137396

98 Aroostook ME 67318 6671.33 10.09064 total 116305 1.727695

1373 Franklin ME 29915 1696.61 17.63222 total 68260 2.281798

1668 Hancock ME 54734 1586.89 34.49136 total 129620 2.368181

2092 Kennebec ME 122044 867.52 140.68148 total 153928 1.261250

3005 Oxford ME 57754 2076.84 27.80859 total 149480 2.588219

3065 Penobscot ME 151817 3397.36 44.68676 total 300850 1.981662

3151 Piscataquis ME 16746 3960.86 4.22787 total 37575 2.243819

3611 Somerset ME 50489 3924.40 12.86541 total 105485 2.089267

3980 Waldo ME 39657 729.92 54.33061 total 98225 2.476864

4056 Washington ME 31321 2562.66 12.22207 total 70935 2.264774

vvp is votes per person.

Here's my work (i could easily be to blame, but)

%not in% <- Negate(%in%)

votes <- read.csv('county_2018.csv')

population <- read.csv('county.population.estimate.csv')

land.area <- read.csv('land.area.csv')

population$county <- gsub(' County','',sub('.','', rownames(population)))

population$X2018 <- as.numeric(gsub(',','',population$X2018))

population <- population %>% separate( county, c('county','state'), sep=', ' )

population <- population[ ,c( 'county', 'state', 'X2018' ) ]

population$state <- state.abb[match(population$state, state.name)]

colnames(population) <- c('county','state','population')

population[ is.na(population$state), 'state' ] <- 'DC'

rownames(population) <- tolower(paste0(population$county,' ',population$state))

land.area <- land.area[ grep(',',land.area$area), ]

land.area <- land.area[ rownames(land.area) %not in% c(2966,2976,2978,2996,2997), ]

rownames(land.area) <- tolower(gsub(',','',land.area$area))

stats <- merge(land.area, population, by=0, all=FALSE)

stats <- stats[ ,c('Row.names','county','state','population','sq.mi') ]

colnames(stats) <- c('mergename','county','state','population','sq.mi')

stats$density <- stats$population / stats$sq.mi

repvote <- votes[ votes$office == 'US Representative', ]

repvote$mergename <- tolower(paste0(repvote$county,' ',repvote$state_po))

repvote <- repvote[ ,c('mergename','mode','candidatevotes') ]

repvote <- repvote %>% distinct()

don't care about individual candidates or party right now, just turnout

but do need to temporarily protect district counts so they dont get misapplied

repvote <- aggregate( candidatevotes ~ mergename + mode + district, repvote, FUN=sum )

but now we DONT need to, since what we want is by-county, not by-district

repvote <- aggregate( candidatevotes ~ mergename + mode, repvote, FUN=sum )

stats <- merge(stats,repvote,by='mergename')

stats <- stats[ ,c('county','state','population','sq.mi','density','mode','candidatevotes') ]

colnames(stats) <- c('county','state','pop','sq.mi','pop.density', 'vote', 'total')

stats[stats$vote %in% c('absentee','absentee by mail','absentee mail','mail ballots'), 'vote'] <- 'mail'

stats[stats$vote %in% c('absentee/early vote','advance in person','early','early vote'), 'vote'] <- 'early'

stats[stats$vote %in% c('election','election day','electon day','machine','one stop'), 'vote'] <- 'in person'

stats <- stats[ stats$total != 0, ]

stats <- stats[ stats$pop.density != 0, ]

stats$vpp <- stats$total / stats$pop

no.total <- stats[ stats$vote != 'total', ]

only.total <- stats[ stats$vote == 'total', ]

library(ggplot2)

require(scales)

ggplot(only.total, aes(x=pop.density, y=vpp, color=vote)) + geom_point(shape=23,alpha=0.33) + geom_smooth() + theme_classic() + scale_x_log10() + scale_y_continuous(trans = log2_trans()) + ggtitle("Voting Habits by County Population Density\n(2018 Midterm Elections)") + xlab("Population per Square Mile") + ylab("Votes Cast Per Person") + geom_vline(xintercept=1000, alpha=0.25) + geom_vline(xintercept=500, alpha=0.25) + annotate("text", x = 120, y = -0.065, label = "(pop/area): U.S. Census\n(votes): MIT Election Data and Science Lab") + annotate("text", x=1100, y=0.4, label='urban line', angle=90, alpha=0.25) + annotate("text", x=550, y=0.4, label='rural line', angle=90, alpha=0.25)

I don't see the same problem with anything under than mode==total, by the way.

Now, I'm pretty sure that MAINE(ME) had started using ranked choice voting by then, so I have to wonder if those who collected the data for this project missed that and effectively counted the first and second choice candidates as multiple votes instead of just one ballot cast?

If you see a dumb error on my part, feel free to point it out.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/MEDSL/2018-elections-official/issues/14, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOJTMZJVIZYP56TIPLWOW3TXRIETANCNFSM5AJKINFQ.

sbaltzmit commented 2 years ago

We've been cleaning and reorganizing these files and I wanted to make sure that we haven't retained this problem, so I'd like to ask a few points of clarification.

One initial observation is that it can't be caused by instant runoff voting reallocations, because our data do not include US House district 2 (we're working to include it, but some work needs to be done to turn the ballot-level data the state provides into precinct-level returns of how much each ballot contributed to the ultimate results). District 1 was decided by a majority so no reallocations needed to occur.

Now maybe it could have been second place votes just being binned as first place votes, but our vote totals match the official vote totals: the sum of votes for US house across all candidates and parties in our file is 349963, which exactly matches the total reported by the state (https://www.maine.gov/sos/cec/elec/results/index.html). Just to make sure the state didn't somehow bin multiple choices on a ballot into multiple votes for a candidate, I compared them to the 2014 county totals, and the 2018 numbers by county look extremely similar to the reported vote totals for the last plurality midterm election in district 1, so I don't think a mistake like that occurred either on our end or on the state's.

So here are a few questions to track down your issue. First, if you run the same code on the state's county totals, you should see the same problem, since they're basically the same numbers as ours, so that's worth checking. Second, I wonder how the situation looks if you use official turnout numbers rather than estimating total county population. Third, are you binning in substantial numbers of undervotes or overvotes?

sbaltzmit commented 2 years ago

Closing issue for now, please follow up if further exploration turns up an issue

MEDSL / 2018-elections-official

Did RCV mess with the data collection for Maine in 2018? #14

repvote <- repvote[ ,c('mergename','mode','candidatevotes') ]

repvote <- repvote %>% distinct()

don't care about individual candidates or party right now, just turnout

but do need to temporarily protect district counts so they dont get misapplied

but now we DONT need to, since what we want is by-county, not by-district