USEPA / EPATADA

This R package can be used to compile and evaluate Water Quality Portal (WQP) data for samples collected from surface water monitoring sites on streams and lakes. It can be used to create applications that support water quality programs and help states, tribes, and other stakeholders efficiently analyze the data.
https://usepa.github.io/EPATADA/
Creative Commons Zero v1.0 Universal
39 stars 18 forks source link

Replicate QC flag #393

Open cristinamullin opened 7 months ago

cristinamullin commented 7 months ago

Is your feature request related to a problem? Please describe.

Users of TADA have noted that it would be useful to incorporate replicate field samples into water quality data analysis by flagging routine field sample measurements whose associated replicate field sample measurements are outside of a user-defined window of precision (relative percent difference or absolute difference). A two-stage data-quality-indicator, where low values should be within the absolute difference limit and high values within the Relative Percent Difference (RPD) limit, may be appropriate. RPD is the calculated difference (RPD) between the routine sample result and its associated replicate sample result. For example, if the RPD/CV exceeds 20% some water quality, analysts consider that to be a potentially concerning lack of precision, especially for non-particulate analytes. However, depending on the characteristic being analyzed and the sampling method, acceptable RPDs can vary widely. Therefore, it is best for the user to define their own level of RPD acceptability. In addition, a tiered approach may be more appropriate, where the widely used 20% RPD for measurements can be used for results above XX-times the detection limit, but also an absolute difference approach can be used for those result-values near the detection limit, or lower than the detection limit (e.g., phosphorus). An absolute difference approach is more appropriate when implementing RPD for samples close to the detection limit, as even small absolute differences might show up as large relative percent differences that "fail" the 20% RPD test.

For example, when nutrient concentrations are close to detection limit, it becomes impossible to have a low RPD. In this scenario, high RPD's are acceptable because if you stand back and look at ALL the data, and not just the replicates, these data may be agreeing perfectly well that nutrients are very low. DO NOT throw out data if RPD is >20%, unless you have good reason, or you will potentially bias your data toward high concentrations. QA procedures should not bias statistical analyses of the data. Note that a modest error in a measurement will have a much smaller effect than implementing a QA process that builds in bias.

Describe the solution you'd like

Write new function to flag paired replicates using a tiered approach, where the widely used 20% RPD for measurements can be used for results above XX-times the detection limit, but also an absolute difference approach can be used for those result-values near the detection limit, or lower than the detection limit (e.g., phosphorus). An absolute difference approach is more appropriate when implementing RPD for samples close to the detection limit, as even small absolute differences might show up as large relative percent differences that "fail" the 20% RPD test.

Additional context

What are replicate samples and how are they used in water analyses?

Replicate field samples are samples taken to assess the reproducibility of the sampling technique or analytical method. They are independently carried through all the steps of the sampling and measurement process in an identical manner to their associated routine field sample and used to measure the precision of the total sampling method.

Theoretically, the analysis of a replicate field sample should yield a very similar result as its associated routine field sample. If the results are not the same or acceptably similar, it could signal possible contamination or other issues in the sampling chain. However, water quality can vary at very small scales. So, the field replicate can mix up analytical precision with small scale variability. Field replicates tell you the potential for your method to yield the same results at a single time and place, to the extent that you are actually in exactly the same place, and the few seconds (or any defined time window) from one sample to the next does not matter, and the water isn’t moving. Be careful about labeling data as imprecise or bad based on this alone.

See Issue Paper: https://usepa.sharepoint.com/:w:/r/sites/AutomatedDataAnalysisWorkingGroup/_layouts/15/Doc.aspx?sourcedoc=%7B12716121-CFA6-4845-88B0-F1C88070B29C%7D&file=IssuePaper_ReplicateSamples_July2023.docx&action=default&mobileredirect=true

And notes: https://usepa.sharepoint.com/:w:/r/sites/AutomatedDataAnalysisWorkingGroup/_layouts/15/Doc.aspx?sourcedoc=%7B4151F130-8C47-4A57-9D56-9A90D92FF74A%7D&file=TADAWorkingGroup_Jul2023.docx&action=default&mobileredirect=true

Reminders for TADA contributors addressing this issue

New features should include all of the following work:

hillarymarler commented 2 months ago

@wokenny13 - is this one you might be interested in working on?

wokenny13 commented 2 months ago

This could be an item I am interested in. I will review the current function and the requested enhancement idea in this request!

hillarymarler commented 1 month ago

This is a comment from @cefergus's review of https://github.com/USEPA/EPATADA/pull/501 (continuous data flagging updates)

I think the revised function looks good from what I could see. I ran the revised function on Fond du Lac data and a random TADA test data set. I looked for observations with same location, depth, comparable data identifier, and organization and it looked like the function is correctly flagging continuous vs discrete observations. Some observations labeled "Field Msr/Obs" and flagged as "Continuous" look like duplicated result values. But maybe a different function can flag those incidences.

hillarymarler commented 1 month ago

After flagging replicates is developed, maybe that function can be run as part of TADA_FlagContinuousData so that replicate samples are not identified as continuous.