WSWCWaterDataExchange / MappingStatesDataToWaDE2.0

Manage all code to map and import state's data into WaDE 2.0

BSD 3-Clause "New" or "Revised" License

4 stars 1 forks source link

To help prevent duplicate records from entering WaDE, the current approach (05/04/2023) is to do the following...

ensure no information is being auto filled by mistake (e.g., WaDE Unspecified)
ensure missing information between POD vs POU is being filled by the other (e.g. WaterSourceTypeCV)
perform groupby() function to build connections between unique IDs (e.g., WaterSourceNativeID, SiteNativeID, AllocationNativeID) even if we have to create our own. This builds the connections we need between water records, PODs, POUs, and combines incomplete records with more complete counterpart records.

However, this approach causes issues when there are native duplicate records that rely on the same nativeID value but are legitimate separate records. See below for examples...

MTwr: http://wrqs.dnrc.mt.gov/ResultsWS.aspx?search=simple&index=8&wrnumber=40J%2098694%2000&status=ACTV!SEVR

Notes on groupby() that will not work...

setting blank records to "" and performing a groupby() on the nativeID (e.g., groupby("AllocationNativeID")) will successfully combine incomplete records wither their more complete counterparts and allows us to create connections between records and sites (which is good). BUT, trying to expand upon the groupby() selection causes the records to be too unique to not combine correctly (e.g., groupby("AllocationNativeID", "AllocationFlow_CFS") will count "" flow records as unique and not properly join, which is bad).
can't replace "" values with np.nan, this actually just drops the entire row during a groupby().

Example code to view the above issue

example dataframe

AllocationNativeID	AllocationFlow_CFS	AllocationVolume_AF	BeneficialUseCategory
A1	1.1	5.5	Irrigation
A1		6.6	LiveStock
C1	3.3	7.7	Other
D1	4.4	8.8	Domestic

NativeID only (current approach)

Current approach. This successfully groups A1 into similar records that differ by BeneficialUseCategory and helps us catch non-similar values in AllocationVolume_AF

groupbyList = ['AllocationNativeID']
tempdfA = tempdf.groupby(groupbyList).agg(lambda x: ','.join([str(elem) for elem in (list(set(x))) if elem != ""])).replace(np.nan, "").reset_index()

AllocationNativeID	AllocationFlow_CFS	AllocationVolume_AF	BeneficialUseCategory
A1	1.1	5.5, 6.6	Irrigation, LiveStock
C1	3.3	7.7	Other
D1	4.4	8.8	Domestic

NativeID and Flow

This fails to group A1 into the group record we would like to see and creates a duplicate record in WaDE.

groupbyList = ['AllocationNativeID', 'AllocationFlow_CFS']
tempdfB = tempdf.groupby(groupbyList).agg(lambda x: ','.join([str(elem) for elem in (list(set(x))) if elem != ""])).replace(np.nan, "").reset_index()
tempdfB

AllocationNativeID	AllocationFlow_CFS	AllocationVolume_AF	BeneficialUseCategory
A1	1.1	5.5	Irrigation
A1		6.6	LiveStock
C1	3.3	7.7	Other
D1	4.4	8.8	Domestic

NativeID and Flow, convert "" to np.nan

This drops the second row entirely, which contained BeneficialUseCategory info we wanted to preserve for A1 records.

tempdfC = tempdf.replace("", np.nan) 
groupbyList = ['AllocationNativeID', 'AllocationFlow_CFS']
tempdfC = tempdfC.groupby(groupbyList).agg(lambda x: ','.join([str(elem) for elem in (list(set(x))) if elem != ""])).replace(np.nan, "").reset_index()

AllocationNativeID	AllocationFlow_CFS	AllocationVolume_AF	BeneficialUseCategory
A1	1.1	5.5	Irrigation
C1	3.3	7.7	Other
D1	4.4	8.8	Domestic

WSWCWaterDataExchange / MappingStatesDataToWaDE2.0

Pre-Processing issue: preventing accidental duplicate records prevents records that share nativeID #162

example dataframe

NativeID only (current approach)

NativeID and Flow

NativeID and Flow, convert "" to np.nan