AICoE / prometheus-anomaly-detector

A newer more updated version of the prometheus anomaly detector (https://github.com/AICoE/prometheus-anomaly-detector-legacy)
GNU General Public License v3.0
596 stars 150 forks source link
hacktoberfest metric-data predicted-metrics prometheus-anomaly-detector prometheus-metrics prophet-model

Anomaly Detection in Prometheus Metrics

This repository contains the prototype for a Prometheus Anomaly Detector (PAD) which can be deployed on OpenShift. The PAD is a framework to deploy a metric prediction model to detect anomalies in prometheus metrics.

Prometheus is the chosen application to do monitoring across multiple products and platforms. Prometheus metrics are time series data identified by metric name and key/value pairs. With the increased amount of metrics flowing in it is getting harder to see the signals within the noise. The current state of the art is to graph out metrics on dashboards and alert on thresholds. This application leverages machine learning algorithms such as Fourier and Prophet models to perform time series forecasting and predict anomalous behavior in the metrics. The predicted values are compared with the actual values and if they differ from the default threshold values, it is flagged as an anomaly.

Use Case

The use case for this framework is to assist teams in real-time alerting of their system/application metrics. The time series forecasting performed by the models can be used by developers to update/enhance their systems to tackle the anomalies in the future.

Configurations

Configuration is currently done via environment variables. The configuration options are defined in prometheus-anomaly-detector/configuration.py.

Once the environment variables are set, you can run the application locally as:

python app.py

You can also use the Makefile to run the application:

make run_app

Using the pre-built Container Image

Implementation

The current setup is as follows: Thoth Dgraph anomaly detection - blog post (1)

Thoth Dgraph anomaly detection - blog post (2)

Model Testing

For a given timeframe of a metric, with known anomalies, the PAD can be run in test-mode to check whether the models reported back these anomalies. The accuracy and performance of the models can then be logged as metrics to MLFlow for comparing the results.

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. It currently offers three components: Screenshot from 2019-09-04 15-19-57 MLFlow: https://mlflow.org/

Test Configurations

Environment variables are loaded from .env. pipenv will load these automatically. So make sure you execute everything via pipenv install.

Configuration is currently done via environment variables. The configuration options are defined in prometheus-anomaly-detector/test_configuration.py.

Once the environment variables are set, you can run the application locally as:

python test_model.py

You can also use the Makefile to run the application:

make run_test

You can now view the metrics being logged in your MLFlow tracking server UI.

Screenshot from 2019-09-04 15-27-36