WSWCWaterDataExchange / MappingStatesDataToWaDE2.0

Manage all code to map and import state's data into WaDE 2.0
BSD 3-Clause "New" or "Revised" License
4 stars 1 forks source link

Pre-Processing issue: preventing accidental duplicate records prevents records that share nativeID #162

Open rwjam opened 1 year ago

rwjam commented 1 year ago

To help prevent duplicate records from entering WaDE, the current approach (05/04/2023) is to do the following...

However, this approach causes issues when there are native duplicate records that rely on the same nativeID value but are legitimate separate records. See below for examples...

Notes on groupby() that will not work...

rwjam commented 1 year ago

Example code to view the above issue

example dataframe

AllocationNativeID AllocationFlow_CFS AllocationVolume_AF BeneficialUseCategory
A1 1.1 5.5 Irrigation
A1   6.6 LiveStock
C1 3.3 7.7 Other
D1 4.4 8.8 Domestic

NativeID only (current approach)

Current approach. This successfully groups A1 into similar records that differ by BeneficialUseCategory and helps us catch non-similar values in AllocationVolume_AF

groupbyList = ['AllocationNativeID']
tempdfA = tempdf.groupby(groupbyList).agg(lambda x: ','.join([str(elem) for elem in (list(set(x))) if elem != ""])).replace(np.nan, "").reset_index()
AllocationNativeID AllocationFlow_CFS AllocationVolume_AF BeneficialUseCategory
A1 1.1 5.5, 6.6 Irrigation, LiveStock
C1 3.3 7.7 Other
D1 4.4 8.8 Domestic

NativeID and Flow

This fails to group A1 into the group record we would like to see and creates a duplicate record in WaDE.

groupbyList = ['AllocationNativeID', 'AllocationFlow_CFS']
tempdfB = tempdf.groupby(groupbyList).agg(lambda x: ','.join([str(elem) for elem in (list(set(x))) if elem != ""])).replace(np.nan, "").reset_index()
tempdfB
AllocationNativeID AllocationFlow_CFS AllocationVolume_AF BeneficialUseCategory
A1 1.1 5.5 Irrigation
A1   6.6 LiveStock
C1 3.3 7.7 Other
D1 4.4 8.8 Domestic

NativeID and Flow, convert "" to np.nan

This drops the second row entirely, which contained BeneficialUseCategory info we wanted to preserve for A1 records.

tempdfC = tempdf.replace("", np.nan) 
groupbyList = ['AllocationNativeID', 'AllocationFlow_CFS']
tempdfC = tempdfC.groupby(groupbyList).agg(lambda x: ','.join([str(elem) for elem in (list(set(x))) if elem != ""])).replace(np.nan, "").reset_index()
AllocationNativeID AllocationFlow_CFS AllocationVolume_AF BeneficialUseCategory
A1 1.1 5.5 Irrigation
C1 3.3 7.7 Other
D1 4.4 8.8 Domestic