Duplicate records resulting from missing or incorrect sampling time

sheet_id 699 and 1149 both contain algae data from EFR 2005. In sheet_id 699 time of collection is reported as 0. In sheet_id 1149 time of collection is not recorded, so we assigned a value of 9999. Our QA/QC check did not flag these as duplicate observations since they have different collection times. As a result, we have the following duplicate observations in the dataframe:

It seems this same issue could arise when other identifiers such as sample depth or station are reported slightly differently among Excel files. I think we need to revise the QA/QC check to screen for identical lake, date, cell_per_l, BV.um3.l, and taxa observations. If these fields are identical, we need to check the station, time, and depth fields to determine if the observations are unique or duplicates.

USEPA / Phytoplankton-Data-Analysis

Duplicate records resulting from missing or incorrect sampling time #32