AmberSJones / pyhydroqc

This software was designed with the purpose of anomaly detection and correction for time series water sensor data. This software was developed using the Logan River Observatory data set.
BSD 3-Clause "New" or "Revised" License
21 stars 12 forks source link
anomaly-correction arima-models detect-anomalies in-situ logan-river-observatory lstm-model water-data water-sensor

Anomaly Detection and Correction for Aquatic Sensor Data

This repository contains software to identify and correct anomalous values in time series data collected by in situ aquatic sensors. The code was developed for application to data collected in the Logan River Observatory, sourced at http://lrodata.usu.edu/tsa/ or on HydroShare. All functions contained in the package are documented here. The package may be installed from the Python Package Index.

The package development, testing, and performance are reported in Jones, A.S., Jones, T.L., Horsburgh, J.S. (2022). Toward automated post processing of aquatic sensor data, Environmental Modelling and Software, https://doi.org/10.1016/j.envsoft.2022.105364

Methods currently implemented include ARIMA (AutoRegressive Integrated Moving Average) and LSTM (Long Short Term Memory). These are time series regression methods that detect anomalies by comparing model estimates to sensor observations and labeling points as anomalous when they exceed a threshold.

There are multiple possible approaches for applying LSTM for anomaly detection/correction.

Correction approaches depend on the method. For ARIMA, each group of consecutive anomalous points is considered as a unit to be corrected. Separate ARIMA models are developed for valid points preceding and following the anomalous group. Model estimates are blended to achieve a correction. For LSTM, correction may be based on a univariate or multivariate approach. Correction is made on a point-by-point basis where each point is considered a separate unit to be corrected. The developed model is used to estimate a correction to the first anomalous point in a group, which is then used as input to estimate the following anomalous point, and so on.

Files are organized by method for anomaly detection and data correction. Utilities files contain functions, wrappers, and parameter definitions called by the other scripts. A typical workflow involves:

  1. Retrieving data
  2. Applying rules-based detection to screen data and apply initial corrections
  3. Developing a model (i.e., ARIMA or LSTM)
  4. Applying model to make time series predictions
  5. Determining a threshold and detecting anomalies by comparing sensor observations to modeled results
  6. Widening the window over which an anomaly is identified
  7. Comparing anomaly detections to data labeled by technicians (if available) and determining metrics
  8. Applying developed models to make corrections for anomalous events

File Descriptions

detect.script.py

This script contains the code to apply anomaly detection methods to data from four sensors (water temperature, specific conductance, pH, dissolved oxygen) at six sites in the Logan River Observatory. The script calls functions to retrieve data, perform rules based anomaly detection and correction, develop and get estimates from five models (ARIMA, LSTM univaraite, LSTM univariate bidirectional, LSTM multivaraiate, and LSTM multivariate bidirectional), determine dynamic thresholds and detect anomalies, widen the window of detection and compare to raw data, and determine metrics. This application script refers to parameters stored in the parameters file.

parameters.py

This file contains assignments of parameters for all steps of the anomaly detection workflow. Parameters are defined specific to each site and sensor that are referenced in the detect script. LSTM parameters are consistent across sites and variables. ARIMA hyper parameters are specific to each site/sensor combination, other parameters are used for rules based anomaly detection, determining dynamic thresholds, and for widening anomalous events.

anomaly_utilities.py

Contains functions for performing anomaly detection and correction:

modeling_utilities.py

Contains functions for building and training models:

rules_detect.py

Contains functions for rules based anomaly detection and preprocessing. Depends on anomaly_utilities.py Functions include:

calibration.py

Contains functions for identifying and correcting calibration events. Functions include:

model_workflow.py

Contains functionality to build and train ARIMA and LSTM models, apply the models to make predictions, set thresholds, detect anomalies, widen anomalous events, and determine metrics. Depends on anomaly_utilities.py, modeling_utilities.py, and rules_detect.py. Wrapper function names are: arima_detect, lstm_detect_univar, and lstm_detect_multivar. LSTM model workflows include options for vanilla or bidirectional. Within each wrapper function, the full detection workflow is followed. Options allow for output of plots, summaries, and metrics.

ARIMA_correct.py

Contains functionality to perform corrections and plot results using ARIMA models. Depends on anomaly_utilities.py.

Dependencies

This software depends on the following Python packages:

Sponsors and Credits

NSF-1931297

The material in this repository is based on work supported by National Science Foundation Grant 1931297. Any opinions, findings, and conclusions or recommendations expressed are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.