USEPA / Phytoplankton-Data-Analysis

Phytoplankton Data Analysis
3 stars 0 forks source link

Duplicate records resulting from missing or incorrect sampling time #32

Open jbeaulie opened 10 years ago

jbeaulie commented 10 years ago

sheet_id 699 and 1149 both contain algae data from EFR 2005. In sheet_id 699 time of collection is reported as 0. In sheet_id 1149 time of collection is not recorded, so we assigned a value of 9999. Our QA/QC check did not flag these as duplicate observations since they have different collection times. As a result, we have the following duplicate observations in the dataframe:

image

It seems this same issue could arise when other identifiers such as sample depth or station are reported slightly differently among Excel files. I think we need to revise the QA/QC check to screen for identical lake, date, cell_per_l, BV.um3.l, and taxa observations. If these fields are identical, we need to check the station, time, and depth fields to determine if the observations are unique or duplicates.

willbarnett commented 9 years ago

This is fixed in algaeCheck.R, but only the case where the sampling time is 9999 or 0000. If there are other cases where duplicates could exist, we need to amend the code.