Open TrentonBush opened 1 month ago
duplicate_locations = geocoded_locations[
geocoded_locations[["county_id_fips", "project_id"]].duplicated(keep=False)
]
duplicate_locations.groupby(['project_id', 'county_id_fips'])['geocoded_locality_type'].agg(lambda x: set(x)).value_counts()
outputs:
{county, city} 17
{county} 10
{city} 5
About 100 duplicate (
project_id
,county_id_fips
) entries are produced in thegridstatus_locations
table due to non-standard formatting of raw place names. These are not whole row duplicates. They occur for a couple of reasons:Roswell, Chaves County
. The processing code is designed to treat delimiters as separating two separate locations, so a city, county entry gets erroneously split in two. Usually both pieces get geocoded back to the same county FIPS code, but this is not always the case for degenerate place names.'Pointe Coupee, Pointe Coupee Parish'
). All the current instances of this are in Louisiana for some reason.'NJ, NY'
with the raw state'NY'
. So they all get mapped to New York County, NY.The impact of this duplication is fairly minor. Thanks to capacity allocation, the total MW are unchanged. But the duplicate county_id_fips will double count the number of projects within a county in the wide format data mart table. I think either the duplicates should be removed in downstream queries or the agg func in
dbcp/data_mart/counties.py:407
needs to be changed from"project_id": "count"
to"project_id": "nunique"