This repository contains the prototype for a Prometheus Anomaly Detector (PAD) which can be deployed on OpenShift. The PAD is a framework to deploy a metric prediction model to detect anomalies in prometheus metrics.
Prometheus is the chosen application to do monitoring across multiple products and platforms. Prometheus metrics are time series data identified by metric name and key/value pairs. With the increased amount of metrics flowing in it is getting harder to see the signals within the noise. The current state of the art is to graph out metrics on dashboards and alert on thresholds. This application leverages machine learning algorithms such as Fourier and Prophet models to perform time series forecasting and predict anomalous behavior in the metrics. The predicted values are compared with the actual values and if they differ from the default threshold values, it is flagged as an anomaly.
The use case for this framework is to assist teams in real-time alerting of their system/application metrics. The time series forecasting performed by the models can be used by developers to update/enhance their systems to tackle the anomalies in the future.
FLT_PROM_URL
- URL for the prometheus host, from where the metric data will be collectedFLT_PROM_ACCESS_TOKEN
- OAuth token to be passed as a header, to connect to the prometheus host (Optional)FLT_METRICS_LIST
- List of metrics that are to be collected from prometheus and train the prophet model.
"up{app='openshift-web-console', instance='172.44.0.18:8443'}; up{app='openshift-web-console', instance='172.44.4.18:8443'}; es_process_cpu_percent{instance='172.44.17.134:30290'}"
, multiple metrics can be separated using a semi-colon ;
.
FLT_RETRAINING_INTERVAL_MINUTES
- This specifies the frequency of the model training, or how often the model is retrained. (Default: 15
)
15
, it will collect the past 15 minutes of metric data every 15 minutes and append it to the training dataframe.FLT_ROLLING_TRAINING_WINDOW_SIZE
- This parameter limits the size of the training dataframe to prevent Out of Memory errors. It can be set to the duration of data that should be stored in memory as dataframes. (Default 15d
)
1d
, every time before training the model using the training dataframe, the metric data that is older than 1 day will be deleted.FLT_PARALLELISM
- An option for parallelism. Each metric is "assigned" a separate model object. This parameter reperesents the number of models that will be trained concurrently.
1
and the upper limit will depend on the number of CPU cores provided to the container.
If you are testing locally, you can do the following:.env
. pipenv
will load these automatically. So make sure you execute everything via pipenv install
.Configuration is currently done via environment variables. The configuration options are defined in prometheus-anomaly-detector/configuration.py
.
Once the environment variables are set, you can run the application locally as:
python app.py
You can also use the Makefile
to run the application:
make run_app
quay.io/aicoe/prometheus-anomaly-detector:latest
docker run --name pad -p 8080:8080 --network host \
--env FLT_PROM_URL=http://demo.robustperception.io:9090 \
--env FLT_RETRAINING_INTERVAL_MINUTES=15 \
--env FLT_METRICS_LIST='up' \
--env APP_FILE=app.py \
--env FLT_DATA_START_TIME=3d \
--env FLT_ROLLING_TRAINING_WINDOW_SIZE=15d \
quay.io/aicoe/prometheus-anomaly-detector:latest
docker rm pad
The current setup is as follows:
yhat
- Predicted time series valueyhat_lower
- Lower bound of uncertainity intervalyhat_upper
- Upper bound of uncertainity intervalFor a given timeframe of a metric, with known anomalies, the PAD can be run in test-mode
to check whether the models reported back these anomalies. The accuracy and performance of the models can then be logged as metrics to MLFlow for comparing the results.
MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment. It currently offers three components: MLFlow: https://mlflow.org/
FLT_PROM_URL
- URL for the prometheus host, from where the metric data will be collectedFLT_PROM_ACCESS_TOKEN
- OAuth token to be passed as a header, to connect to the prometheus host (Optional)FLT_METRICS_LIST
- List of metrics that are to be collected from prometheus and train the prophet model.
"up{app='openshift-web-console', instance='172.44.0.18:8443'}; up{app='openshift-web-console', instance='172.44.4.18:8443'}; es_process_cpu_percent{instance='172.44.17.134:30290'}"
, multiple metrics can be separated using a semi-colon ;
.
FLT_RETRAINING_INTERVAL_MINUTES
- This specifies the frequency of the model training, or how often the model is retrained. (Default: 15
)
15
, it will collect the past 15 minutes of metric data every 15 minutes and append it to the training dataframe.FLT_ROLLING_TRAINING_WINDOW_SIZE
- This parameter limits the size of the training dataframe to prevent Out of Memory errors. It can be set to the duration of data that should be stored in memory as dataframes. (Default 15d
)
1d
, every time before training the model using the training dataframe, the metric data that is older than 1 day will be deleted.MLFLOW_TRACKING_URI
- URI for the MLFlow tracking serverFLT_TRUE_ANOMALY_THRESHOLD
- Threshold value to calculate true anomalies using a linear functionFLT_DATA_START_TIME
- This specifies the starting time of your metric data timeframe windowFLT_DATA_END_TIME
- This specifies the ending time of your metric data timeframe windowEnvironment variables are loaded from .env
. pipenv
will load these automatically. So make sure you execute everything via pipenv install
.
Configuration is currently done via environment variables. The configuration options are defined in prometheus-anomaly-detector/test_configuration.py
.
Once the environment variables are set, you can run the application locally as:
python test_model.py
You can also use the Makefile
to run the application:
make run_test
You can now view the metrics being logged in your MLFlow tracking server UI.