deployment-gap-model-education-fund / deployment-gap-model

ETL code for the Deployment Gap Model Education Fund
https://www.deploymentgap.fund/
MIT License
6 stars 2 forks source link

Resolve duplicate queue projects due to delimiters in raw county names #335

Open TrentonBush opened 1 month ago

TrentonBush commented 1 month ago

About 100 duplicate (project_id, county_id_fips) entries are produced in the gridstatus_locations table due to non-standard formatting of raw place names. These are not whole row duplicates. They occur for a couple of reasons:

The impact of this duplication is fairly minor. Thanks to capacity allocation, the total MW are unchanged. But the duplicate county_id_fips will double count the number of projects within a county in the wide format data mart table. I think either the duplicates should be removed in downstream queries or the agg func in dbcp/data_mart/counties.py:407 needs to be changed from "project_id": "count" to "project_id": "nunique"

TrentonBush commented 1 week ago
duplicate_locations = geocoded_locations[
        geocoded_locations[["county_id_fips", "project_id"]].duplicated(keep=False)
    ]
duplicate_locations.groupby(['project_id', 'county_id_fips'])['geocoded_locality_type'].agg(lambda x: set(x)).value_counts()

outputs:

{county, city}    17
{county}          10
{city}             5