AICoE / prometheus-anomaly-detector

A newer more updated version of the prometheus anomaly detector (https://github.com/AICoE/prometheus-anomaly-detector-legacy)
GNU General Public License v3.0
596 stars 150 forks source link

Problems integrating the image (unknown error) #168

Closed Cesuuur closed 2 years ago

Cesuuur commented 2 years ago

Hello I'm having trouble integrating the image into my platform (Kubernetes). I have been doing a lot of tests and I have not drawn any conclusions. I'm pretty sure the error has to do with the labeling of my metrics.

At first I had the same error that appears in this issue 500: Internal Server Error and that is due to the need for this check in prometheus_client :

        if labelkwargs:
            if sorted(labelkwargs) != sorted(self._labelnames):
                raise ValueError('Incorrect label names')
            labelvalues = tuple(str(labelkwargs[l]) for l in self._labelnames)

At the beginning it initializes with a series of labels that in each new request it checks that they match. So, you need to pass all the metric labels to it even if you don't want to specify them. And that i have done, I've done the test with a couple of metrics, and each one has given me a result (I wanted to run the app.py locally, but there was no way, I tried with several virtual environments and python 3.8 but this issue passed me Getting AttributeError: Can't pickle local object 'BaseAsyncIOLoop.initialize..assign_thread_identity' error)

Here is the configuration:


            - name: FLT_PROM_URL
              value: http://thanos-querier:10902
            - name: FLT_RETRAINING_INTERVAL_MINUTES
              value: "15"
            - name: FLT_METRICS_LIST
              value: 'kong_latency_sum{type="kong",app=~".*",channel_id=~".*",client_id=~".*",environment=~".*",instance=~".*",job=~".*",kubernetes_namespace=~".*",kubernetes_pod_name=~".*",pod_template_hash=~".*",service=~".*",stack=~".*",tenant=~".*"}'
            - name: FLT_ROLLING_TRAINING_WINDOW_SIZE
              value: "1d"
            - name: FLT_PARALLELISM
              value: "3"
            - name: FLT_DEBUG_MODE
              value: "True"

And here the result I get:

---> Running application from Python script (app.py) ...
2021-11-29 17:02:04,955:INFO:configuration: Metric data rolling training window size: 14 days, 23:59:59.944930
2021-11-29 17:02:04,955:INFO:configuration: Model retraining interval: 15 minutes
2021-11-29 17:02:04,991:ERROR:prophet.plot: Importing plotly failed. Interactive plots will not work.
2021-11-29 17:02:04,995:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): thanos-querier:10902
2021-11-29 17:02:05,001:DEBUG:urllib3.connectionpool: http://thanos-querier:10902 "GET /api/v1/query?query=kong_latency_sum%7Btype%3D%22kong%22%2Capp%3D~%22.%2A%22%2Cchannel_id%3D~%22.%2A%22%2Cclient_id%3D~%22.%2A%22%2Cenvironment%3D~%22.%2A%22%2Cinstance%3D~%22.%2A%22%2Cjob%3D~%22.%2A%22%2Ckubernetes_namespace%3D~%22.%2A%22%2Ckubernetes_pod_name%3D~%22.%2A%22%2Cpod_template_hash%3D~%22.%2A%22%2Cservice%3D~%22.%2A%22%2Cstack%3D~%22.%2A%22%2Ctenant%3D~%22.%2A%22%7D HTTP/1.1" 200 64
2021-11-29 17:02:05,003:INFO:__main__: Training models using ProcessPool of size:3
2021-11-29 17:02:05,016:INFO:__main__: Initializing Tornado Web App
2021-11-29 17:02:05,017:DEBUG:asyncio: Using selector: EpollSelector
2021-11-29 17:02:05,020:INFO:__main__: Will retrain model every 15 minutes
2021-11-29 17:02:11,399:INFO:tornado.access: 200 GET /metrics (10.240.1.142) 3.20ms
2021-11-29 17:02:41,390:INFO:tornado.access: 200 GET /metrics (10.240.1.142) 1.16ms
2021-11-29 17:03:11,389:INFO:tornado.access: 200 GET /metrics (10.240.1.142) 0.95ms

(These logs are testing with 15 days rolling_window and 15m retrain) I have tried to manually do the http get ... /metrics that prometheus would do, and I verify that it does not send the forecasts, only metadata like:

...
...
...
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.1
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 18.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

It seems that the model is not trained. And with the other metric, the opposite happens, the model is trained, but it seems that the tornado server never gets to initialize.

value: 'up{aadpodidbinding=~".*",app=~".*",environment=~".*",instance=~".*",job=~".*",kubernetes_namespace=~".*",kubernetes_pod_name=~".*",pod_template_hash=~".*",stack=~".*"}'

up metric is a dumb metric, because is 1 or 0, but I was just trying to see what was wrong.

2021-11-29 14:54:20,871:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): thanos-querier:10902
2021-11-29 14:54:20,879:DEBUG:urllib3.connectionpool: http://thanos-querier:10902 "GET /api/v1/query?query=up%7Bapp%3D%27prometheus%27%2Ccontroller_revision_hash%3D%27prometheus-d49bd6b%27%2Cenvironment%3D%27cluster-shared-azure%27%2Cinstance%3D%2710.3.0.210%3A9090%27%2Cjob%3D%27kubernetes-pods%27%2Ckubernetes_namespace%3D%27baikal-state%27%2Ckubernetes_pod_name%3D%27prometheus-0%27%2Cstack%3D%27management%27%2Cstatefulset_kubernetes_io_pod_name%3D%27prometheus-0%27%7D%5B86400s%5D&time=1638197661 HTTP/1.1" 200 None
2021-11-29 14:54:20,977:INFO:model: training data range: 2021-11-29 09:36:08.943000064 - 2021-11-29 14:53:38.943000064
2021-11-29 14:54:20,977:DEBUG:model: begin training
2021-11-29 14:54:21,941:DEBUG:model:                                yhat  yhat_lower  yhat_upper
timestamp
2021-11-29 14:54:38.943000064   1.0         1.0         1.0
2021-11-29 14:54:21,945:INFO:__main__: Total Training time taken = 0:00:01.064886, for metric: up {'app': 'prometheus', 'controller_revision_hash': 'prometheus-d49bd6b', 'environment': 'cluster-shared-azure', 'instance': '10.3.0.210:9090', 'job': 'kubernetes-pods', 'kubernetes_namespace': 'baikal-state', 'kubernetes_pod_name': 'prometheus-0', 'stack': 'management', 'statefulset_kubernetes_io_pod_name': 'prometheus-0'}
2021-11-29 14:54:21,945:DEBUG:prometheus_api_client.prometheus_connect: start_time: 2021-11-28 14:54:22.003457
2021-11-29 14:54:21,945:DEBUG:prometheus_api_client.prometheus_connect: end_time: 2021-11-29 14:54:21.945876
2021-11-29 14:54:21,946:DEBUG:prometheus_api_client.prometheus_connect: chunk_size: None
2021-11-29 14:54:21,946:DEBUG:prometheus_api_client.prometheus_connect: Prometheus Query: up{app='prometheus-es-exporter',environment='cluster-shared-azure',instance='10.240.0.75:8080',job='kubernetes-service-endpoints',kubernetes_name='prometheus-es-exporter',kubernetes_namespace='baikal-system'}
2021-11-29 14:54:21,948:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): thanos-querier:10902
2021-11-29 14:54:21,956:DEBUG:urllib3.connectionpool: http://thanos-querier:10902 "GET /api/v1/query?query=up%7Bapp%3D%27prometheus-es-exporter%27%2Cenvironment%3D%27cluster-shared-azure%27%2Cinstance%3D%2710.240.0.75%3A8080%27%2Cjob%3D%27kubernetes-service-endpoints%27%2Ckubernetes_name%3D%27prometheus-es-exporter%27%2Ckubernetes_namespace%3D%27baikal-system%27%7D%5B86400s%5D&time=1638197662 HTTP/1.1" 200 1903

As I said before, here it trains the model, all the time but it never starts the tornado server (These logs are with the configuration given above).

Thank you very much in advance. And by the way, quite a good project :)

Cesuuur commented 2 years ago

In case you ever come back here ... I have seen an error that has to do with labels. In app.py:

# A gauge set for the predicted values
GAUGE_DICT = dict()
for predictor in PREDICTOR_MODEL_LIST:
    unique_metric = predictor.metric
    label_list = list(unique_metric.label_config.keys())
    label_list.append("value_type")
    if unique_metric.metric_name not in GAUGE_DICT:
        GAUGE_DICT[unique_metric.metric_name] = Gauge(
            unique_metric.metric_name + "_" + predictor.model_name,
            predictor.model_description,
            label_list,
        )

Here you initialize the Gauge, and you pass a label list for initialize the class, but there is one error, you make a Gauge class for each metric, but you only store the set of labels for one of the series. This end in the 500: internal Error that I commented above when Prometheus does HTTP Get.

 # Check for all the columns available in the prediction
            # and publish the values for each of them
            for column_name in list(prediction.columns):
                GAUGE_DICT[metric_name].labels(
                    **predictor_model.metric.label_config, value_type=column_name
                ).set(prediction[column_name][0])

It's easy to solve, I've done a function that set all the labels for all the series from one metric.

def all_labels(unique_metric, label_list):
    # global GAUGE_DICT
    # Si es una nueva métrica inicializamos la lista
    if unique_metric.metric_name not in GAUGE_DICT:
        label_list = list(unique_metric.label_config.keys())
        label_list.append("value_type")
    # si no recorremos todo el set de etiquetas (nueva serie, pero misma métrica) y añadimos las que no tengamos ya guardadas
    else:
        for label in list(unique_metric.label_config.keys()):
            if label not in label_list:
                label_list.append(label)
    return label_list

....
....
....

# A gauge set for the predicted values
GAUGE_DICT = dict()
label_list = list()
for predictor in PREDICTOR_MODEL_LIST:
    unique_metric = predictor.metric
    label_list = all_labels(unique_metric, label_list)
    if unique_metric.metric_name not in GAUGE_DICT:
        GAUGE_DICT[unique_metric.metric_name] = Gauge(
            unique_metric.metric_name + "_" + predictor.model_name,
            predictor.model_description,
            label_list,
        )
sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 2 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/AICoE/prometheus-anomaly-detector/issues/168#issuecomment-1121539326): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.