USEPA / EPATADA

This R package can be used to compile and evaluate Water Quality Portal (WQP) data for samples collected from surface water monitoring sites on streams and lakes. It can be used to create applications that support water quality programs and help states, tribes, and other stakeholders efficiently analyze the data.
https://usepa.github.io/EPATADA/
Creative Commons Zero v1.0 Universal
40 stars 18 forks source link

TADAOutliers #47

Open cristinamullin opened 2 years ago

cristinamullin commented 2 years ago

Consider adding outlier information to TADA stats function.

Append one or two additional columns to the dataset flagging outliers at the individual station/char level and/or at the all stations/char level.

Add new function input for stats to flag outliers across single station (input ID) or all stations: Scale = AllStations Scale = IndividualStations

cristinamullin commented 1 year ago

We need to be cautious about removal of outliers in environmental datasets.

This would only provide an option to review and remove data that are different than approximately 99% of the data available for a given parameter and unit combination. This is only to try to catch invalid data - many outliers are still valid results.

The tool would provide an option to flag data that falls above or below these values: Upper Outlier = 75th Percentile + 1.5 (75th percentile - 25th percentile) Lower Outlier = 25th Percentile - 1.5 (75th percentile - 25th percentile)

Jim Hagy (see TADA Working Group notes: https://usepa.sharepoint.com/:w:/r/sites/AutomatedDataAnalysisWorkingGroup/_layouts/15/Doc.aspx?sourcedoc=%7BC74D9A1C-DCEE-46B1-AC07-E05AD63E2714%7D&file=IssuePaper_RetrievalQAQC_Jan2021.docx&action=default&mobileredirect=true): If would be useful to be able to select whether this flagging process is applied to the original data or the log of the data. For data that are strongly log-normally distributed, many valid observations will be >1.5*IQR above the 75th percentile. But if you applied those percentiles to the logs, it would be a different story.

This is one place, where the distribution charts become helpful. We could apply the outlier test to original data or log of the data depending on the data distribution. See examples in CDC app: https://ergapps.shinyapps.io/atsdrepc/

cristinamullin commented 1 year ago

This topic could potentially be related to the censored data method used for each characteristic (but feel free to move this to a new issue):

Example.....

Cristina- is 1/x useful? Lesley Merrick (OR) - they use it when the detection limit (or ½ detection limit) is above the water quality standard, particularly when using geomean. This is our white paper on using censored data in the IR. https://www.oregon.gov/deq/FilterDocs/iriCensoredData.pdf

cristinamullin commented 5 months ago

This issue is related to the TADA Shiny issue and pending development of an outlier tab: https://github.com/USEPA/TADAShiny/issues/137

hillarymarler commented 1 week ago

A few existing packages related to outliers:

  1. envoutliers: Methods for Identification of Outliers in Environmental Data - https://cran.r-project.org/web/packages/envoutliers/index.html
  2. EnvStats: Package for Environmental Statistics, Including US EPA Guidance - https://cran.r-project.org/web/packages/EnvStats/index.html (some outlier functions)
  3. outliers: A collection of some tests commonly used for identifying outliers - https://cran.r-project.org/web/packages/outliers/index.html

@cristinamullin are there any notes from previous working group discussions that might be helpful for me to review on this topic?

@wokenny13 the EnvStats package might be useful to check out for some of the mod 3 functions.