Open cristinamullin opened 2 years ago
We need to be cautious about removal of outliers in environmental datasets.
This would only provide an option to review and remove data that are different than approximately 99% of the data available for a given parameter and unit combination. This is only to try to catch invalid data - many outliers are still valid results.
The tool would provide an option to flag data that falls above or below these values: Upper Outlier = 75th Percentile + 1.5 (75th percentile - 25th percentile) Lower Outlier = 25th Percentile - 1.5 (75th percentile - 25th percentile)
Jim Hagy (see TADA Working Group notes: https://usepa.sharepoint.com/:w:/r/sites/AutomatedDataAnalysisWorkingGroup/_layouts/15/Doc.aspx?sourcedoc=%7BC74D9A1C-DCEE-46B1-AC07-E05AD63E2714%7D&file=IssuePaper_RetrievalQAQC_Jan2021.docx&action=default&mobileredirect=true): If would be useful to be able to select whether this flagging process is applied to the original data or the log of the data. For data that are strongly log-normally distributed, many valid observations will be >1.5*IQR above the 75th percentile. But if you applied those percentiles to the logs, it would be a different story.
This is one place, where the distribution charts become helpful. We could apply the outlier test to original data or log of the data depending on the data distribution. See examples in CDC app: https://ergapps.shinyapps.io/atsdrepc/
This topic could potentially be related to the censored data method used for each characteristic (but feel free to move this to a new issue):
Example.....
Cristina- is 1/x useful? Lesley Merrick (OR) - they use it when the detection limit (or ½ detection limit) is above the water quality standard, particularly when using geomean. This is our white paper on using censored data in the IR. https://www.oregon.gov/deq/FilterDocs/iriCensoredData.pdf
This issue is related to the TADA Shiny issue and pending development of an outlier tab: https://github.com/USEPA/TADAShiny/issues/137
A few existing packages related to outliers:
@cristinamullin are there any notes from previous working group discussions that might be helpful for me to review on this topic?
@wokenny13 the EnvStats package might be useful to check out for some of the mod 3 functions.
Consider adding outlier information to TADA stats function.
Append one or two additional columns to the dataset flagging outliers at the individual station/char level and/or at the all stations/char level.
Add new function input for stats to flag outliers across single station (input ID) or all stations: Scale = AllStations Scale = IndividualStations