The anomoly-detector container is restarting every 2 or 3 mins

mahmoud-mahdi commented 4 years ago

Error Log

2020-09-10 08:23:49,033:INFO:__main__: Total Training time taken = 0:00:01.760440, for metric: container_memory_rss {'id': '/system.slice/sshd.service', 'instance': '10.0.1.12:8080', 'job': 'cadvisor'}
2020-09-10 08:23:49,058:INFO:model: training data range: 2020-09-10 07:53:04.176000118 - 2020-09-10 08:23:34.176000118
2020-09-10 08:23:50,720:INFO:__main__: Total Training time taken = 0:00:01.682862, for metric: container_memory_rss {'id': '/system.slice/sshd.service', 'instance': '10.0.1.22:8080', 'job': 'cadvisor'}
2020-09-10 08:23:50,750:INFO:model: training data range: 2020-09-10 07:53:03.842999935 - 2020-09-10 08:23:48.842999935
2020-09-10 08:23:52,498:INFO:__main__: Total Training time taken = 0:00:01.772154, for metric: container_memory_rss {'id': '/system.slice/sshd.service', 'instance': '10.0.1.26:8080', 'job': 'cadvisor'}
2020-09-10 08:23:52,529:INFO:model: training data range: 2020-09-09 13:25:43.851999998 - 2020-09-10 08:23:46.825999975
2020-09-10 08:23:54,160:INFO:__main__: Total Training time taken = 0:00:01.657185, for metric: container_memory_rss {'id': '/system.slice/system-getty.slice', 'instance': '10.0.1.12:8080', 'job': 'cadvisor'}
2020-09-10 08:23:54,189:INFO:model: training data range: 2020-09-10 07:53:04.176000118 - 2020-09-10 08:23:49.176000118
2020-09-10 08:23:55,977:INFO:__main__: Total Training time taken = 0:00:01.813741, for metric: container_memory_rss {'id': '/system.slice/system-getty.slice', 'instance': '10.0.1.22:8080', 'job': 'cadvisor'}
2020-09-10 08:23:56,006:INFO:model: training data range: 2020-09-10 07:53:03.842999935 - 2020-09-10 08:23:48.842999935
2020-09-10 08:23:57,641:INFO:__main__: Total Training time taken = 0:00:01.660867, for metric: container_memory_rss {'id': '/system.slice/system-getty.slice', 'instance': '10.0.1.26:8080', 'job': 'cadvisor'}
Traceback (most recent call last):
  File "app.py", line 159, in <module>
    train_model(initial_run=True, data_queue=predicted_model_queue)
  File "app.py", line 137, in train_model
    )[0]
IndexError: list index out of range

mahmoud-mahdi commented 4 years ago

When i gave it a simple metric like up i am getting now the following error:

monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    | 2020-09-10 09:04:31,136:ERROR:tornado.application: Uncaught exception GET /metrics (10.0.1.7)
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    | HTTPServerRequest(protocol='http', host='10.0.1.12:8080', method='GET', uri='/metrics', version='HTTP/1.1', remote_ip='10.0.1.7')
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    |     result = await result
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    |   File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    |     for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.6k3k3cj7uv8k@docker01    | 2020-09-10 09:04:31,137:ERROR:tornado.access: 500 GET /metrics (10.0.1.7) 1.40ms

It is no longer restarting but i cannot access it over port 8080 and Prometheus cannot scrape it

500: Internal Server Error

4n4nd commented 4 years ago

Could you please enable debug mode and show me the logs? You can do this by setting the env FLT_DEBUG_MODE=True

mahmoud-mahdi commented 4 years ago

monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | Matplotlib created a temporary config/cache directory at /tmp/matplotlib-o1oorkdn because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,149:INFO:configuration: Metric data rolling training window size: 123 days, 15:23:27.134343
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,150:INFO:configuration: Model retraining interval: 10 minutes
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,342:ERROR:fbprophet.plot: Importing plotly failed. Interactive plots will not work.
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,350:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): prometheus:9090
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,356:DEBUG:urllib3.connectionpool: http://prometheus:9090 "GET /api/v1/query?query=up%7Bjob%3D%22prometheus%22%7D HTTP/1.1" 200 160
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,376:DEBUG:prometheus_api_client.prometheus_connect: start_time: 2020-05-10 00:00:00.241852
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,377:DEBUG:prometheus_api_client.prometheus_connect: end_time: 2020-09-10 15:23:27.376202
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,377:DEBUG:prometheus_api_client.prometheus_connect: chunk_size: None
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,377:DEBUG:prometheus_api_client.prometheus_connect: Prometheus Query: up{instance='localhost:9090',job='prometheus'}
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:27,388:DEBUG:urllib3.connectionpool: http://prometheus:9090 "GET /api/v1/query?query=up%7Binstance%3D%27localhost%3A9090%27%2Cjob%3D%27prometheus%27%7D%5B10682607s%5D&time=1599751407 HTTP/1.1" 200 None
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:28,490:INFO:model: training data range: 2020-09-09 12:00:13.226999998 - 2020-09-10 15:23:27.885999918
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:28,490:DEBUG:model: begin training
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:30,244:DEBUG:model:                                yhat  yhat_lower  yhat_upper
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | timestamp                                                  
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:24:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:25:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:26:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:27:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:28:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:29:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:30:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:31:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:32:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:33:27.885999918   1.0         1.0         1.0
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:30,251:INFO:__main__: Total Training time taken = 0:00:02.857312, for metric: up {'instance': 'localhost:9090', 'job': 'prometheus'}
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:30,252:INFO:__main__: Initializing Tornado Web App
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:30,257:DEBUG:asyncio: Using selector: EpollSelector
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     obj = _ForkingPickler.dumps(obj)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     cls(buf, protocol).dump(obj)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | TypeError: can't pickle _thread.RLock objects
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:30,262:INFO:__main__: Will retrain model every 10 minutes
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:43,269:ERROR:tornado.application: Uncaught exception GET / (10.0.0.2)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | HTTPServerRequest(protocol='http', host='172.28.200.111:8080', method='GET', uri='/', version='HTTP/1.1', remote_ip='10.0.0.2')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:43,273:ERROR:tornado.access: 500 GET / (10.0.0.2) 6.77ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:43,287:WARNING:tornado.access: 404 GET /favicon.ico (10.0.0.2) 0.76ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:55,672:ERROR:tornado.application: Uncaught exception GET /metrics (10.0.1.7)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | HTTPServerRequest(protocol='http', host='10.0.1.20:8080', method='GET', uri='/metrics', version='HTTP/1.1', remote_ip='10.0.1.7')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:23:55,674:ERROR:tornado.access: 500 GET /metrics (10.0.1.7) 2.27ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:24:01,666:ERROR:tornado.application: Uncaught exception GET / (10.0.0.2)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | HTTPServerRequest(protocol='http', host='172.28.200.111:8080', method='GET', uri='/', version='HTTP/1.1', remote_ip='10.0.0.2')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:24:01,667:ERROR:tornado.access: 500 GET / (10.0.0.2) 1.70ms
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:24:10,668:ERROR:tornado.application: Uncaught exception GET /metrics (10.0.1.7)
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | HTTPServerRequest(protocol='http', host='10.0.1.20:8080', method='GET', uri='/metrics', version='HTTP/1.1', remote_ip='10.0.1.7')
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | Traceback (most recent call last):
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "/opt/conda/envs/prophet-env/lib/python3.6/site-packages/tornado/web.py", line 1703, in _execute
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     result = await result
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |   File "app.py", line 71, in get
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    |     for predictor_model in self.settings["model_list"]:
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | KeyError: 'model_list'
monitoring_prometheus-anomaly.1.batx0fu2g6yj@docker01    | 2020-09-10 15:24:10,669:ERROR:tornado.access: 500 GET /metrics (10.0.1.7) 1.09ms

4n4nd commented 4 years ago

Also could you please tell me how you are building the container? And the PAD Configuration so I can replicate it

mahmoud-mahdi commented 4 years ago

her is my stack file:

  prometheus-anomaly:
    image: prom/prometheus-anomaly-detector:latest
    networks:
      - monitoring
    ports:
      - "8080:8080"
    environment:
      - FLT_PROM_URL=http://prometheus:9090
      - FLT_METRICS_LIST=up{job="prometheus"}
      - FLT_DEBUG_MODE=True
#      - FLT_METRICS_LIST=container_memory_rss{job="cadvisor"}
      - FLT_RETRAINING_INTERVAL_MINUTES=10
      - FLT_ROLLING_TRAINING_WINDOW_SIZE=5
      - APP_FILE=app.py
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
          - node.hostname == docker01
      resources:
        limits:
          memory: 2048M

in the beginning it was working and i was able to access docker-host:8080 and Prometheus was able to scrap the metrics.
I added more metrics '- FLT_METRICS_LIST=container_memory_rss{job="cadvisor"}' then i got the first error
I just removed that metric because it has many labels 'instances' but i am still cannot access the web interface and it appears also as down in Prometheus targets

4n4nd commented 4 years ago

Could you change the rolling window size to FLT_ROLLING_TRAINING_WINDOW_SIZE=5d? but that should not be an issue

I tried replicating your configuration and it works fine for me

mahmoud-mahdi commented 4 years ago

i changed it but the same issue
it was working fine with me but after i changed the following:
FLT_METRICS_LIST=container_memory_rss{job="cadvisor"}
then i removed the service 'Container' and deployed it again then it is not working

4n4nd commented 4 years ago

This is my log

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-u5nkvj7l because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2020-09-10 16:09:21,543:INFO:configuration: Metric data rolling training window size: 4 days, 23:59:59.940549
2020-09-10 16:09:21,544:INFO:configuration: Model retraining interval: 1 minutes
2020-09-10 16:09:21,704:ERROR:fbprophet.plot: Importing plotly failed. Interactive plots will not work.
2020-09-10 16:09:24,021:INFO:model: training data range: 2020-09-05 16:09:30.276000023 - 2020-09-10 16:09:21.862999916
2020-09-10 16:09:26,050:INFO:__main__: Total Training time taken = 0:00:03.709155, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:09:26,960:INFO:model: training data range: 2020-09-05 16:09:30.697000027 - 2020-09-10 16:09:21.862999916
2020-09-10 16:09:29,031:INFO:__main__: Total Training time taken = 0:00:02.558205, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:09:29,943:INFO:model: training data range: 2020-09-05 16:09:36.108999968 - 2020-09-10 16:09:26.111999989
2020-09-10 16:09:31,987:INFO:__main__: Total Training time taken = 0:00:02.537416, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:09:32,916:INFO:model: training data range: 2020-09-05 16:09:34.710999966 - 2020-09-10 16:09:24.710000038
2020-09-10 16:09:35,010:INFO:__main__: Total Training time taken = 0:00:02.590763, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
2020-09-10 16:09:35,013:INFO:__main__: Initializing Tornado Web App
2020-09-10 16:09:35,063:INFO:__main__: Will retrain model every 1 minutes
2020-09-10 16:09:53,662:INFO:tornado.access: 200 GET / (172.17.0.1) 465.96ms
2020-09-10 16:09:53,736:WARNING:tornado.access: 404 GET /favicon.ico (172.17.0.1) 0.53ms
2020-09-10 16:10:00,029:INFO:tornado.access: 200 GET / (172.17.0.1) 394.14ms
2020-09-10 16:10:35,122:INFO:schedule: Running job Every 1 minute do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7f035ffc75c0>) (last run: [never], next run: 2020-09-10 16:10:35)
2020-09-10 16:10:35,236:INFO:model: training data range: 2020-09-05 16:10:40.269000053 - 2020-09-10 16:10:30.267999887
2020-09-10 16:10:37,255:INFO:__main__: Total Training time taken = 0:00:02.043853, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:10:37,361:INFO:model: training data range: 2020-09-05 16:10:40.696000099 - 2020-09-10 16:10:30.707000017
2020-09-10 16:10:39,451:INFO:__main__: Total Training time taken = 0:00:02.110926, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:10:39,560:INFO:model: training data range: 2020-09-05 16:10:46.114000082 - 2020-09-10 16:10:36.114000082
2020-09-10 16:10:41,570:INFO:__main__: Total Training time taken = 0:00:02.030247, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:10:41,676:INFO:model: training data range: 2020-09-05 16:10:44.710000038 - 2020-09-10 16:10:34.710000038
2020-09-10 16:10:43,747:INFO:__main__: Total Training time taken = 0:00:02.091811, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
2020-09-10 16:11:43,807:INFO:schedule: Running job Every 1 minute do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7f035ffc75c0>) (last run: 2020-09-10 16:10:43, next run: 2020-09-10 16:11:43)
2020-09-10 16:11:43,914:INFO:model: training data range: 2020-09-05 16:11:50.272000074 - 2020-09-10 16:11:40.272000074
2020-09-10 16:11:45,918:INFO:__main__: Total Training time taken = 0:00:02.025308, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:11:46,024:INFO:model: training data range: 2020-09-05 16:11:50.714999914 - 2020-09-10 16:11:40.697000027
2020-09-10 16:11:48,093:INFO:__main__: Total Training time taken = 0:00:02.089482, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:11:48,198:INFO:model: training data range: 2020-09-05 16:11:56.128000020 - 2020-09-10 16:11:46.108000040
2020-09-10 16:11:50,272:INFO:__main__: Total Training time taken = 0:00:02.092415, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:11:50,377:INFO:model: training data range: 2020-09-05 16:11:54.723000050 - 2020-09-10 16:11:44.709000111
2020-09-10 16:11:51,183:INFO:tornado.access: 200 GET / (172.17.0.1) 411.94ms
2020-09-10 16:11:52,506:INFO:__main__: Total Training time taken = 0:00:02.148522, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}
2020-09-10 16:12:52,567:INFO:schedule: Running job Every 1 minute do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7f035ffc75c0>) (last run: 2020-09-10 16:11:52, next run: 2020-09-10 16:12:52)
2020-09-10 16:12:52,673:INFO:model: training data range: 2020-09-05 16:13:00.269000053 - 2020-09-10 16:12:50.269000053
2020-09-10 16:12:54,723:INFO:__main__: Total Training time taken = 0:00:02.068827, for metric: up {'instance': 'demo.robustperception.io:9090', 'job': 'prometheus'}
2020-09-10 16:12:54,830:INFO:model: training data range: 2020-09-05 16:13:00.697000027 - 2020-09-10 16:12:50.713000059
2020-09-10 16:12:56,862:INFO:__main__: Total Training time taken = 0:00:02.052248, for metric: up {'instance': 'demo.robustperception.io:9091', 'job': 'pushgateway'}
2020-09-10 16:12:56,970:INFO:model: training data range: 2020-09-05 16:13:06.105000019 - 2020-09-10 16:12:56.109999895
2020-09-10 16:12:59,055:INFO:__main__: Total Training time taken = 0:00:02.104857, for metric: up {'instance': 'demo.robustperception.io:9093', 'job': 'alertmanager'}
2020-09-10 16:12:59,162:INFO:model: training data range: 2020-09-05 16:13:04.710000038 - 2020-09-10 16:12:54.710000038
2020-09-10 16:13:01,202:INFO:__main__: Total Training time taken = 0:00:02.060389, for metric: up {'instance': 'demo.robustperception.io:9100', 'job': 'node'}

mahmoud-mahdi commented 4 years ago

would you please provide me wit those working configurations. deployment/stack file

4n4nd commented 4 years ago

I think the issue was with some older dependency versions, I am building a new container image which you can use to run it.

4n4nd commented 4 years ago

I just tested the new image using this command: docker run --name pad -p 127.0.0.1:8080:8080 --env APP_FILE=app.py --env FLT_PROM_URL=http://demo.robustperception.io:9090 --env FLT_RETRAINING_INTERVAL_MINUTES=1 --env FLT_METRICS_LIST=up quay.io/4n4nd/prometheus-anomaly-detector:latest and it worked for me

mahmoud-mahdi commented 4 years ago

Thank you, it is working now. I do not know if there are some ready Grafana Dashboards or some hints about how to visualize the metrics.

4n4nd commented 4 years ago

Not really, we don't have any ready dashboards.

4n4nd commented 4 years ago

@mahmoud-mahdi do you mind if I close this issue?

4n4nd commented 4 years ago

I'll close this issue, if you need any help please feel free to open a new one.

AICoE / prometheus-anomaly-detector

The anomoly-detector container is restarting every 2 or 3 mins #132