USEPA / Phytoplankton-Data-Analysis

Phytoplankton Data Analysis
3 stars 0 forks source link

Duplicates / Replicates #17

Closed mjpdenver closed 10 years ago

mjpdenver commented 10 years ago

In file Drew data/h/PHYTO-02.xls there are duplicate? samples which will have the same ID. For example,

BHR 20001 112767 7/25/2002 0' Achnanthes catenata 5529 464471 BHR 20001 112774 7/25/2002 0' Achnanthes catenata 5836 490206

Currently, we don't have a qualifier field to handle this. Is this a problem? How will these be handled during analysis?

Other records from different files look like they could be the same, but rounding errors. ID lake station depth_ft date taxa cell_per_l BV.um3.L class hab sheet_id 69014 2BHR20001200606079999000 BHR 20001 0 20060607 Achnanthes minutissima 3228.235 461637.6 FALSE 1171 102552 2BHR20001200606079999000 BHR 20001 0 20060607 Achnanthes minutissima 3228.000 461638.0 FALSE 1143 screen shot 2014-03-07 at 5 02 17 pm

from files ( sorry, these limes might be hard to see)

Drew data/h/Phyto 2006.xls and Drew data/l/Phytoplankton 2006.xls. Of course seeing the similar file names, one might guess they are the same. However; these files have vastly different number of rows. So comparisons would need to be done on a record by record basis.

In an algorithm, could one specify how close would one say cell_per_l and BV would be between files before one would conclude they were the same?

jbeaulie commented 10 years ago

First, thanks for catching this! I'm not sure I would have thought to include this QC check. I really appreciate your work on this.

I forwarded the first question to Lisa Underwood.

As for the second question, rounding errors shouldn't result in discrepancies greater than 1. Could we apply this criteria to the data and address remaining duplicates on a case by case basis?

jbeaulie commented 10 years ago

Drew data/h/PHYTO-02.xls duplicates

mjpdenver commented 10 years ago

Averaged values will have different (less) variance than individual values. There might not be any better solution than averaging, but we should keep track of the original readings.

In one of the example I illustrated (Achnanthes minutissima) , two of the records appear to be very nearly identical - suggesting that one the rounded value of the other. The average value would be an acceptable value to use - but it would be to assume there are two samples.

I think I will try an approach where we values that are very similar are considered to be the same record and possibly label duplicates with a qualifier.

Date: Tue, 11 Mar 2014 09:38:36 -0700 From: notifications@github.com To: Phytoplankton-Data-Analysis@noreply.github.com CC: Matt_Pocernich@hotmail.com Subject: Re: [Phytoplankton-Data-Analysis] Duplicates / Replicates (#17)

Drew data/h/PHYTO-02.xls duplicates

I checked with Lisa Underwood and these are field duplicates. Lets average them.

— Reply to this email directly or view it on GitHub.

mjpdenver commented 10 years ago

This issue is addressed in the algaeCheck.R script. The qual_replicate field is coded with an R when it is inferred that two samples are replicates - identical ID information - but "distinct" results.