Open e-belfer opened 1 week ago
The only fuzzy matching addfips
does is to replace 12 diacritics (like "é") and standardize 3 abbreviations like "Saint"/"St.". The code is like 10 lines. Your input data has to have perfect quality modulo those two replacements or the matches will fail.
The cardinal sin of addfips
is to not maintain the source census data or handle the slight changes over time, which is the only thing that needs maintenance! But I'll stop there because I think I've written more lines complaining about addfips
than there are lines of code in the entire package 😂
Is your feature request related to a problem? Please describe. In #3531 we've learned that the
addfips
package we rely on to map state/county data into a FIPS code is irregularly maintained. This is causing problems for downstream users, and we should consider deprecating this package and moving to a different solution.Describe the solution you'd like There are a few different use cases for FIPS encoding that we could hypothetically imagine for PUDL. Right now,
add_fips_ids
inpudl.src.helpers.py
is the only place we import theaddfips
package. This method is expects a county and state name, and a FIPS vintage:Approaches to resolving the problem
County and state names -> FIPS Code
This is our current FIPS encoding use case. We could:
addfips
, and implement the fixes we want to see that solve the problem identified in #3531. We'd relatively easily benefit from any upstream changes, and if the project returns to being more actively maintained it wouldn't be particularly complex to switch back.addfips
methodology. As is noted in #338, the fuzzy matching implemented inaddfips
is definitely of relevance to us, so we shouldn't think we can just pull down a CSV and merge county and state names without losing records.pygris
package seems like a great API-free option for implementing this functionality down the line, though!Address -> Lat lon -> FIPS Code or lat/lon -> FIPS code
As mentioned above, this wouldn't address the use case for 2/3 EIA datasets we're currently using the
addfips
package for. Seepygris
package, which seems interesting if we want to explore this further, but this is out of scope for the actual problem we're trying to resolve. See also #3531 for some discussion on other ways to implement this.Proposed solutions
Fork
addfips
and implement our desired changes, then do our own release so we can import this as a package. Pros: We don't have to write much code, and we can keep up to date with any changes made inaddfips
. Cons: Extremely annoying and we're likely to forget about this. We're still relying on a largely unmaintained package.Recreate the logic of
addfips
inpudl.helpers.py
using Census data directly (e.g., this 2023 data or by calling the Census's geometry API, see[here](https://www.census.gov/data/developers/data-sets/geo-info.html)
for more info). Pros: We are our own maintainers. Cons: We have to recreate the fuzzy-matching logic in the package we're currently using to prevent performance loss.Find some other new Python package to meet our needs.
us
is maintained but doesn't yet support counties despite an open PR on this topic from 2017: https://github.com/unitedstates/python-us/. Thecensus
package depends on it, so there's no county support there either.