extract district from contest name

bill10 commented 7 years ago

If a contest name has district as part of it, extract the district and make it a separate column.

bill10 commented 7 years ago

It turned out to be not as easy. Found two messy cases.

District appears multiple times, e.g., "NC DISTRICT COURT JUDGE DISTRICT 15A - Lambeth Seat". Solution: take the last "District".
One more question: should we keep Lambeth in the contest name? I think he is the candidate, isn't him?
It means something else, e.g., "YANCEY COUNTY SOIL AND WATER CONSERVATION DISTRICT SUPERVISOR". Solution: if DISTRICT is not followed by a number then ignore it.

rtburg commented 7 years ago

Looking in the election results for the Nov. 8, 2011 municipal elections, I find these variants of "DISTRICT"

A) TOWN OF LEWISTON WOODVILLE TOWN COUNCIL LEWISTON DISTRICT (Where "DISTRICT" is a suffix. It follows a word that uniquely describes the office for which the candidate is running.)

B) TOWN OF NAVASSA COMMISSIONER DISTRICT 1 (Where "DISTRICT" is a prefix. It immediately precedes one or more characters that uniquely identifies the office for which the candidate is running. Note that sometimes the numeral following "DISTRICT" is in Roman format, such as "TOWN OF BERMUDA RUN COUNCILMAN DISTRICT III" and other times the unique identifier is a letter, such as "TOWN OF ENFIELD TOWN COMMISSIONER DISTRICT A")

C) BRUNSWICK COUNTY - SOUTHEAST BRUNSWICK SANITARY DISTRICT COMMISSIONER (Where "DISTRICT" is an integral part of the office name. Removing it from the office name or using it to split the office name would not be appropriate.)

In the results file for the Nov. 8, 2016 election I find these variants...

A) NC DISTRICT COURT JUDGE DISTRICT 25 (CHERRY) (This is where one unique district has multiple seats in it. The N.C. House and N.C. Senate used to be this way, too. But now I believe it is just judicial seats. It appears as if the Board of Elections identifies these seats by the last name of the incumbent. However, to make them useful for longitudinal analysis we should give each seat its own unique identifier that is independent of the name of the incumbent at any given point. For example, this particular seat -- perhaps "DISTRICT 25 SEAT A" in our nomenclature -- might be held by Cherry, Thornburg, Smith, Wang, and a dozen other names over time.

rtburg commented 7 years ago

As I was cutting and pasting office names for this, I noticed that there is often a " " in the elections results file from BOE. I was hoping this could be universally used as a delimiter. But sometimes it appears between jurisdiction and office. But sometimes it appears in other places.

rtburg commented 7 years ago

One suggestion here might be to use Open Civic Data schema (what's plural of schema? schemae? schemi?) for North Carolina and/or connect with folks who are working on this project to see how we can contribute? (For example, I don't think they have judicial districts)

An example of its implementation in a similar setting is in the Open Elections project: http://docs.openelections.net/common-fields/

rtburg commented 7 years ago

I think our 80-20 rule here may tell us to simply drop judicial contests for now. Even if we manually enter each judicial office for a particular election, we would then have to make a manual connection between a contest for a unique office -- NC DISTRICT COURT JUDGE DISTRICT 25 (CHERRY) -- with any contests for the same unique office in prior elections. In this example, there may be multiple contests for NC DISTRICT COURT JUDGE DISTRICT 25 in a single election, so we need to include the name of the incumbent to uniquely identify that contest. However, this same office two years earlier will have gone by a different name if there were a different incumbent in the seat.

I guess we could also just intake judicial contests as they are, using DISTRICT [#] + ([incumbent_name]) as the unique identifier in an election, and just not worry about longitudinal analysis. That's probably makes the most sense, because the ability to do longitudinal analysis is limited even for legislative and Congressional districts ("House District 27" after a redistricting, for example, may not have any overlapping geography with "House District 27" as it existed prior to the redistricting.)

bill10 commented 7 years ago

Agree. I think one purpose of the database is to make queries easier. Ideally, one could just type "show me all results for NC DISTRICT COURT JUDGE" and get all relevant contests across districts, seats and years. Then one could filter by districts or years, etc.

So maybe I should clean it as much as possible but not throwing away any information. As @rtburg said, from NC DISTRICT COURT JUDGE DISTRICT 25 (CHERRY) I can parse out NC DISTRICT COURT JUDGE and DISTRICT 25 (CHERRY), and DISTRICT 25 (CHERRY) will be in another column. Sounds good?

That is actually what the ingestor is doing now 0f415e9e223923e4cf5cd3b0053edf9d115d11ab. I split on DISTRICT if it is followed by a number. I will add roman numerals and single-character district cases.

The case A) above is almost impossible to detect and maybe leave it as-is according to our 80-20 rule.

A) TOWN OF LEWISTON WOODVILLE TOWN COUNCIL LEWISTON DISTRICT (Where "DISTRICT" is a suffix. It follows a word that uniquely describes the office for which the candidate is running.)

bill10 commented 7 years ago

We can deal with district followed by a single letter or roman numeral between 1-9 now 85d48ad33186e17335fd01b511ddf24ee040365e

rtburg commented 7 years ago

This issue was moved to NCVotes/results-ingestor#9

NCVotes / ncvoter

extract district from contest name #6