Closed mahmoud-mahdi closed 4 years ago
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | 2020-09-10 09:04:31,136:ERROR:tornado.application: Uncaught exception GET /metrics (10.0.1.7)
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | HTTPServerRequest(protocol='http', host='10.0.1.12:8080', method='GET', uri='/metrics', version='HTTP/1.1', remote_ip='10.0.1.7')
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | result = await result
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01 | 2020-09-10 09:04:31,137:ERROR:tornado.access: 500 GET /metrics (10.0.1.7) 1.40ms
500: Internal Server Error
Could you please enable debug mode and show me the logs?
You can do this by setting the env FLT_DEBUG_MODE=True
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | Matplotlib created a temporary config/cache directory at /tmp/matplotlib-o1oorkdn because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,149:INFO:configuration: Metric data rolling training window size: 123 days, 15:23:27.134343
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,150:INFO:configuration: Model retraining interval: 10 minutes
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,342:ERROR:fbprophet.plot: Importing plotly failed. Interactive plots will not work.
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,350:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): prometheus:9090
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,356:DEBUG:urllib3.connectionpool: http://prometheus:9090 "GET /api/v1/query?query=up%7Bjob%3D%22prometheus%22%7D HTTP/1.1" 200 160
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,376:DEBUG:prometheus_api_client.prometheus_connect: start_time: 2020-05-10 00:00:00.241852
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,377:DEBUG:prometheus_api_client.prometheus_connect: end_time: 2020-09-10 15:23:27.376202
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,377:DEBUG:prometheus_api_client.prometheus_connect: chunk_size: None
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,377:DEBUG:prometheus_api_client.prometheus_connect: Prometheus Query: up{instance='localhost:9090',job='prometheus'}
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:27,388:DEBUG:urllib3.connectionpool: http://prometheus:9090 "GET /api/v1/query?query=up%7Binstance%3D%27localhost%3A9090%27%2Cjob%3D%27prometheus%27%7D%5B10682607s%5D&time=1599751407 HTTP/1.1" 200 None
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:28,490:INFO:model: training data range: 2020-09-09 12:00:13.226999998 - 2020-09-10 15:23:27.885999918
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:28,490:DEBUG:model: begin training
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:30,244:DEBUG:model: yhat yhat_lower yhat_upper
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | timestamp
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:24:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:25:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:26:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:27:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:28:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:29:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:30:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:31:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:32:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:33:27.885999918 1.0 1.0 1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:30,251:INFO:__main__: Total Training time taken = 0:00:02.857312, for metric: up {'instance': 'localhost:9090', 'job': 'prometheus'}
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:30,252:INFO:__main__: Initializing Tornado Web App
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:30,257:DEBUG:asyncio: Using selector: EpollSelector
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | obj = _ForkingPickler.dumps(obj)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | cls(buf, protocol).dump(obj)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | TypeError: can't pickle _thread.RLock objects
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:30,262:INFO:__main__: Will retrain model every 10 minutes
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:43,269:ERROR:tornado.application: Uncaught exception GET / (10.0.0.2)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | HTTPServerRequest(protocol='http', host='172.28.200.111:8080', method='GET', uri='/', version='HTTP/1.1', remote_ip='10.0.0.2')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:43,273:ERROR:tornado.access: 500 GET / (10.0.0.2) 6.77ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:43,287:WARNING:tornado.access: 404 GET /favicon.ico (10.0.0.2) 0.76ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:55,672:ERROR:tornado.application: Uncaught exception GET /metrics (10.0.1.7)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | HTTPServerRequest(protocol='http', host='10.0.1.20:8080', method='GET', uri='/metrics', version='HTTP/1.1', remote_ip='10.0.1.7')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:23:55,674:ERROR:tornado.access: 500 GET /metrics (10.0.1.7) 2.27ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:24:01,666:ERROR:tornado.application: Uncaught exception GET / (10.0.0.2)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | HTTPServerRequest(protocol='http', host='172.28.200.111:8080', method='GET', uri='/', version='HTTP/1.1', remote_ip='10.0.0.2')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:24:01,667:ERROR:tornado.access: 500 GET / (10.0.0.2) 1.70ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:24:10,668:ERROR:tornado.application: Uncaught exception GET /metrics (10.0.1.7)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | HTTPServerRequest(protocol='http', host='10.0.1.20:8080', method='GET', uri='/metrics', version='HTTP/1.1', remote_ip='10.0.1.7')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01 | 2020-09-10 15:24:10,669:ERROR:tornado.access: 500 GET /metrics (10.0.1.7) 1.09ms
Also could you please tell me how you are building the container? And the PAD Configuration so I can replicate it
prometheus-anomaly:
image: prom/prometheus-anomaly-detector:latest
networks:
- monitoring
ports:
- "8080:8080"
environment:
- FLT_PROM_URL=http://prometheus:9090
- FLT_METRICS_LIST=up{job="prometheus"}
- FLT_DEBUG_MODE=True
# - FLT_METRICS_LIST=container_memory_rss{job="cadvisor"}
- FLT_RETRAINING_INTERVAL_MINUTES=10
- FLT_ROLLING_TRAINING_WINDOW_SIZE=5
- APP_FILE=app.py
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.hostname == docker01
resources:
limits:
memory: 2048M
Could you change the rolling window size to FLT_ROLLING_TRAINING_WINDOW_SIZE=5d
? but that should not be an issue
I tried replicating your configuration and it works fine for me
i changed it but the same issue
it was working fine with me but after i changed the following:
FLT_METRICS_LIST=container_memory_rss{job="cadvisor"}
then i removed the service 'Container' and deployed it again then it is not working
This is my log
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-u5nkvj7l because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2020-09-10 16:09:21,543:INFO:configuration: Metric data rolling training window size: 4 days, 23:59:59.940549
2020-09-10 16:09:21,544:INFO:configuration: Model retraining interval: 1 minutes
2020-09-10 16:09:21,704:ERROR:fbprophet.plot: Importing plotly failed. Interactive plots will not work.
2020-09-10 16:09:24,021:INFO:model: training data range: 2020-09-05 16:09:30.276000023 - 2020-09-10 16:09:21.862999916
2020-09-10 16:09:26,050:INFO:__main__: Total Training time taken = 0:00:03.709155, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:09:26,960:INFO:model: training data range: 2020-09-05 16:09:30.697000027 - 2020-09-10 16:09:21.862999916
2020-09-10 16:09:29,031:INFO:__main__: Total Training time taken = 0:00:02.558205, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:09:29,943:INFO:model: training data range: 2020-09-05 16:09:36.108999968 - 2020-09-10 16:09:26.111999989
2020-09-10 16:09:31,987:INFO:__main__: Total Training time taken = 0:00:02.537416, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:09:32,916:INFO:model: training data range: 2020-09-05 16:09:34.710999966 - 2020-09-10 16:09:24.710000038
2020-09-10 16:09:35,010:INFO:__main__: Total Training time taken = 0:00:02.590763, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
2020-09-10 16:09:35,013:INFO:__main__: Initializing Tornado Web App
2020-09-10 16:09:35,063:INFO:__main__: Will retrain model every 1 minutes
2020-09-10 16:09:53,662:INFO:tornado.access: 200 GET / (172.17.0.1) 465.96ms
2020-09-10 16:09:53,736:WARNING:tornado.access: 404 GET /favicon.ico (172.17.0.1) 0.53ms
2020-09-10 16:10:00,029:INFO:tornado.access: 200 GET / (172.17.0.1) 394.14ms
2020-09-10 16:10:35,122:INFO:schedule: Running job Every 1 minute do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7f035ffc75c0>) (last run: [never], next run: 2020-09-10 16:10:35)
2020-09-10 16:10:35,236:INFO:model: training data range: 2020-09-05 16:10:40.269000053 - 2020-09-10 16:10:30.267999887
2020-09-10 16:10:37,255:INFO:__main__: Total Training time taken = 0:00:02.043853, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:10:37,361:INFO:model: training data range: 2020-09-05 16:10:40.696000099 - 2020-09-10 16:10:30.707000017
2020-09-10 16:10:39,451:INFO:__main__: Total Training time taken = 0:00:02.110926, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:10:39,560:INFO:model: training data range: 2020-09-05 16:10:46.114000082 - 2020-09-10 16:10:36.114000082
2020-09-10 16:10:41,570:INFO:__main__: Total Training time taken = 0:00:02.030247, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:10:41,676:INFO:model: training data range: 2020-09-05 16:10:44.710000038 - 2020-09-10 16:10:34.710000038
2020-09-10 16:10:43,747:INFO:__main__: Total Training time taken = 0:00:02.091811, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
2020-09-10 16:11:43,807:INFO:schedule: Running job Every 1 minute do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7f035ffc75c0>) (last run: 2020-09-10 16:10:43, next run: 2020-09-10 16:11:43)
2020-09-10 16:11:43,914:INFO:model: training data range: 2020-09-05 16:11:50.272000074 - 2020-09-10 16:11:40.272000074
2020-09-10 16:11:45,918:INFO:__main__: Total Training time taken = 0:00:02.025308, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:11:46,024:INFO:model: training data range: 2020-09-05 16:11:50.714999914 - 2020-09-10 16:11:40.697000027
2020-09-10 16:11:48,093:INFO:__main__: Total Training time taken = 0:00:02.089482, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:11:48,198:INFO:model: training data range: 2020-09-05 16:11:56.128000020 - 2020-09-10 16:11:46.108000040
2020-09-10 16:11:50,272:INFO:__main__: Total Training time taken = 0:00:02.092415, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:11:50,377:INFO:model: training data range: 2020-09-05 16:11:54.723000050 - 2020-09-10 16:11:44.709000111
2020-09-10 16:11:51,183:INFO:tornado.access: 200 GET / (172.17.0.1) 411.94ms
2020-09-10 16:11:52,506:INFO:__main__: Total Training time taken = 0:00:02.148522, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
2020-09-10 16:12:52,567:INFO:schedule: Running job Every 1 minute do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7f035ffc75c0>) (last run: 2020-09-10 16:11:52, next run: 2020-09-10 16:12:52)
2020-09-10 16:12:52,673:INFO:model: training data range: 2020-09-05 16:13:00.269000053 - 2020-09-10 16:12:50.269000053
2020-09-10 16:12:54,723:INFO:__main__: Total Training time taken = 0:00:02.068827, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:12:54,830:INFO:model: training data range: 2020-09-05 16:13:00.697000027 - 2020-09-10 16:12:50.713000059
2020-09-10 16:12:56,862:INFO:__main__: Total Training time taken = 0:00:02.052248, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:12:56,970:INFO:model: training data range: 2020-09-05 16:13:06.105000019 - 2020-09-10 16:12:56.109999895
2020-09-10 16:12:59,055:INFO:__main__: Total Training time taken = 0:00:02.104857, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:12:59,162:INFO:model: training data range: 2020-09-05 16:13:04.710000038 - 2020-09-10 16:12:54.710000038
2020-09-10 16:13:01,202:INFO:__main__: Total Training time taken = 0:00:02.060389, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
I think the issue was with some older dependency versions, I am building a new container image which you can use to run it.
I just tested the new image using this command:
docker run --name pad -p 127.0.0.1:8080:8080 --env APP_FILE=app.py --env FLT_PROM_URL=http://demo.robustperception.io:9090 --env FLT_RETRAINING_INTERVAL_MINUTES=1 --env FLT_METRICS_LIST=up quay.io/4n4nd/prometheus-anomaly-detector:latest
and it worked for me
Thank you, it is working now. I do not know if there are some ready Grafana Dashboards or some hints about how to visualize the metrics.
Not really, we don't have any ready dashboards.
@mahmoud-mahdi do you mind if I close this issue?
I'll close this issue, if you need any help please feel free to open a new one.