USEPA / EPATADA

This R package can be used to compile and evaluate Water Quality Portal (WQP) data for samples collected from surface water monitoring sites on streams and lakes. It can be used to create applications that support water quality programs and help states, tribes, and other stakeholders efficiently analyze the data.
https://usepa.github.io/EPATADA/
Creative Commons Zero v1.0 Universal
39 stars 18 forks source link

QC data handling #213

Closed ehinman closed 1 year ago

ehinman commented 1 year ago

From meeting with R8 (Tina Laidlaw, Troy Hill, Maggie Pierce) about QC data review and handling for tribes. These are future improvements we could add to TADA package/Shiny. • FLAGS o Create flag function that is used to calculate % data types, which we will need for the censor data summary too. e.g., 10% censored, 10% QAQC, 75% routine, 5% unknown • To create flag, look at the following fields & create bin logic for each allowable value:  ResultValueType  ActivityType  ResultMeasureQualifier  ActivityGroupType  DetectionQuantitationLimitType  ResultDetectionCondition ResultStatus StatisticalBase • Maybe these as well  AnalyticalMethod  AnalyticalMethodContext  SampleCollectionEquipment  SampleCollectionMethod  SampleCollectionMethodContext  ThermalPreservativeUsed o More flag function ideas • ..if a QC dup is some magnitude difference from the routine sample...flag as potentially invalid? • If duplicates are present, add option to average o Research question for TADA Working Group: How else is QAQC data prepared for use in assessments? What flags and transformations are needed to group it with routine samples/preprocess it for use? Do we have examples?

cristinamullin commented 1 year ago

@ehinman with your recent PR, do we now flag data types by values in the ActivityTypeCode as part of autoclean as well? We could flag and mark these in the overall keep/remove file as remove too... If we did that, it doesn't mean they can't be used to confirm other result values as quality checked or not, but means the QAQC data would not by itself be counted as a separate value in analyses.

Note: I believe there is code here that we can leverage for this issue: https://github.com/massbays-tech/MassWateR

cristinamullin commented 1 year ago

We can leverage the domain table for this and check if new values are added as a test (like you did for detection limit types): https://cdx.epa.gov/wqx/download/DomainValues/ActivityType.CSV

Logic (flag results with bolded values):

Field Msr/Obs Field Msr/Obs-Habitat Assessment Field Msr/Obs-Incidental Field Msr/Obs-Portable Data Logger Quality Control Alternative Measurement Sensitivity Quality Control Alternative Measurement Sensitivity Plus Quality Control Field Calibration Check Quality Control Field Msr/Obs Post-Calibration Quality Control Field Msr/Obs Pre-Calibration Quality Control Field Replicate Habitat Assessment Quality Control Field Replicate Msr/Obs Quality Control Field Replicate Portable Data Logger Quality Control Field Replicate Sample-Composite Quality Control Field Sample Equipment Rinsate Blank Quality Control Lab Sample Equipment Rinsate Blank Quality Control Sample-Blind Duplicate Quality Control Sample-Equipment Blank Quality Control Sample-Field Ambient Conditions Blank Quality Control Sample-Field Blank Quality Control Sample-Field Replicate Quality Control Sample-Field Spike Quality Control Sample-Field Surrogate Spike Quality Control Sample-Inter-lab Split Quality Control Sample-Lab Blank Quality Control Sample-Lab Continuing Calibration Verification Quality Control Sample-Lab Control Sample/Blank Spike Quality Control Sample-Lab Control Sample/Blank Spike Duplicate Quality Control Sample-Lab Control Standard Quality Control Sample-Lab Control Standard Duplicate Quality Control Sample-Lab Duplicate Quality Control Sample-Lab Duplicate 2 Quality Control Sample-Lab Initial Calib Certified Reference Material Quality Control Sample-Lab Initial Calibration Verification Quality Control Sample-Lab Matrix Spike Quality Control Sample-Lab Matrix Spike Duplicate Quality Control Sample-Lab Re-Analysis Quality Control Sample-Lab Spike Quality Control Sample-Lab Spike Duplicate Quality Control Sample-Lab Spike Target Quality Control Sample-Lab Spike of a Lab Blank Quality Control Sample-Lab Split Quality Control Sample-Lab Surrogate Control Standard Quality Control Sample-Lab Surrogate Control Standard Duplicate Quality Control Sample-Lab Surrogate Method Blank Quality Control Sample-Measurement Precision Sample Quality Control Sample-Other Quality Control Sample-Post-preservative Blank Quality Control Sample-Pre-preservative Blank Quality Control Sample-Reagent Blank Quality Control Sample-Reference Sample Quality Control Sample-Trip Blank Quality Control-Calibration Check Quality Control-Calibration Check Buffer Quality Control-Meter Lab Blank Quality Control-Meter Lab Duplicate Quality Control-Meter Lab Duplicate 2 Quality Control-Negative Control Sample-Composite With Parents Sample-Composite Without Parents Sample-Depletion Replicate Sample-Field Split Sample-Field Subsample Sample-Integrated Cross-Sectional Profile Sample-Integrated Flow Proportioned Sample-Integrated Horizontal Profile Sample-Integrated Horizontal and Vertical Composite Profile Sample-Integrated Time Series Sample-Integrated Vertical Profile Sample-Negative Control Sample-Other Sample-Positive Control Sample-Routine Sample-Routine Resample

ehinman commented 1 year ago

@cristinamullin We do not do anything with ActivityTypeCodes as of yet, but certainly could/should with the convention you suggest. Happy to start a new branch and pursue this once the other PR passes and is merged into develop.

cristinamullin commented 1 year ago

@katiehealy tagging you in case you may be interested in working on the logic and code for a new package function to flag or transform result values based on this QC metadata information? For example, if a QC duplicate result value (e.g. "Sample-Routine Resample", "Quality Control Sample-Field Replicate") is some magnitude difference from the routine sample (e.g. "Field Msr/Obs") ... flag as potentially invalid? Or if duplicates are present and not flagged for that first reason, add an option to average the value with the routine sample value?

cristinamullin commented 1 year ago

currently, we only deal with this issue in the vignette:

This chunk of code removes rows where any value in the ActivityTypeCode filed includes the string "Quality")

See WQX domain file to review all the ActivityTypeCode allowable values:

https://cdx.epa.gov/wqx/download/DomainValues/ActivityType.CSV

Access all WQX Domain Files

https://www.epa.gov/waterdata/storage-and-retrieval-and-water-quality-exchange-domain-services-and-downloads

TADAProfileClean14 <- dplyr::filter(TADAProfileClean13, !(ActivityTypeCode %in% ActivityTypeCode[grepl("Quality",ActivityTypeCode)]))

katiehealy commented 1 year ago

Sure, I could work on this! but I think I'll need a little more explanation about how to match the QC sample with the routine sample. Also, are we only interested in the differences between routine resamples and routine samples? And are there not already quality flags from the labs for blanks/spikes that don't meet QC criteria?