csg-org / esb-data-standard

EAVS Section B Data Standard
https://eavs-section-b-data-standard.readthedocs.io/en/latest/
2 stars 1 forks source link

Use Alpha-3 Code of ISO 3166 for Standardized VoterMailingCountry #18

Closed colinmacfarlane closed 4 years ago

colinmacfarlane commented 5 years ago

In 2016, the strings for VoterMailingCountry varied in spelling, capitalization, and punctuation (ex: "Bolivia" vs. "Bolivia, Plurinational State of". Or "KENYA" vs "Kenya"). VoterMailingCountry should all be standardized to ISO 3166's alpha-3 code. The standardized letters will make it easy to write crosswalks if analysts want to switch between ISO 3166's alpha-3 code and other country formats. It also conserves space. 2016's data needed a 56 character length (str56) variable - this would use str3. Big data issues are already a concern and should be managed as this scales to more localities and multiple transactions.

jungshadow commented 5 years ago

While I'm generally fine with making this change—noting that we currently ask that the name follows the ISO 3166 standard—I'd like to run it by the jurisdictions. I have concerns around validation and data harvesting. For data harvesting, most of these datasets are being pulled directly from databases and there is no easy way to utilize a library to process the data inside a query. The jurisdictions would either have to standardize their data or do post-processing after harvesting, both options have unappealing sides. Regarding validation, the alpha-3 code might be difficult for the states to validate against the original dataset.

For one specific data harvesting example, as I understand it, California enforced all of their counties to separate and standardize their country names in their address data, but not according to ISO 3166, so there are some fairly significant differences. Due to this happening under a policy decision, California counties wouldn't be able to re-standardize their county names, so they would have to map their existing values against ISO 3166 or write a script to process the data that results from the database query, which involves additional time/effort.