catalyst-cooperative / rmi-energy-communities

Partnership between Catalyst and RMI to identify energy communities as defined by the Inflation Reduction Act
MIT License
4 stars 2 forks source link

Impute missing lat, lon coordinates #113

Open katie-lamb opened 1 year ago

katie-lamb commented 1 year ago

There are records in the MSHA mines and EIA plants data that are missing latitude and longitude coordinates. Currently, these records are being excluded. Instead, try to impute the Census tract and county from the other locational data given for these records.

zaneselvans commented 1 year ago

We're (IMHO) overly aggressively dropping lat/lon values in the main PUDL ETL right now, and should make this fix as far upstream as possible. IIRC, right now we're basically treating the floating point values as strings and declaring them inconsistent if the digits aren't identical, which is not good.

Ideally we would use the haversine distance between all the different (lon,lat) points to estimate an "actual" location and identify any totally crazy outliers (and assign to them the actual location). Probably rectilinear coordinates are good enough though and much simpler, since all the points should be very close to each other, and if they're not very close, it'll be obvious in either spherical or euclidean coords.

We could also convert (lon, lat) into a geopoint / tuple stored in a single column.