AICoE / prometheus-anomaly-detector

A newer more updated version of the prometheus anomaly detector (https://github.com/AICoE/prometheus-anomaly-detector-legacy)
GNU General Public License v3.0
597 stars 151 forks source link

Not able see predication metrics on prometheus Graph. #110

Closed devoprock closed 4 years ago

devoprock commented 4 years ago

I have followed below document i setup prophet on EC2 machine and pulling metrics(FLT_METRICS_LIST) from prometheus end point(FLT_PROM_URL) and some how i dont see any result on predication metrics reports with prophet predication.

I see below continues below o/p on prophet machine. Can you please some one help on this ?

"Convergence detected: relative gradient magnitude is below tolerance 2020-02-27 05:05:29,945:INFO:main: Total Training time taken = 0:00:13.435543, for metric: container_cpu:container_cpu_usage_seconds_total:rate:sum {} 2020-02-27 05:07:30,086:INFO:schedule: Running job Every 2 minutes do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7fed042683d0>) (last run: 2020-02-27 05:05:29, next run: 2020-02-27 05:07:29) 2020-02-27 05:07:30,158:INFO:model: training data range: 2020-02-16 18:36:32.441999912 - 2020-02-27 05:06:32.441999912 Initial log joint probability = -624.556 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 14102.9 0.0341506 242.469 1 1 116 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 199 14111.8 0.00151911 266.794 1 1 244 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 299 14116.3 0.00037893 126.022 0.6299 0.6299 358 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 399 14118.4 0.000192582 86.9727 0.3648 1 479 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 499 14122.9 0.00185006 157.717 1 1 593 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 587 14124.5 0.000250693 238.923 2.704e-06 0.001 742 LS failed, Hessian reset 599 14124.6 6.07051e-05 83.8182 0.7625 0.7625 755 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 645 14124.6 5.48574e-07 86.1275 0.3767 1 812 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance 2020-02-27 05:07:39,472:INFO:main: Total Training time taken = 0:00:09.362668, for metric: container_cpu:container_cpu_usage_seconds_total:rate:sum {}"

4n4nd commented 4 years ago

In your Prometheus targets list, can you check if it was successfully able to scrape the anomaly-detector?

devoprock commented 4 years ago

@4n4nd Thanks Anand Sanmukhani ,In Prometheus targets list dash board i dont see any anomaly-detector entire .

python3 app.py script i see yhat yhat_lower yhat_upper values but same thing i am not seeing Prometheus.Please help on this issue.

" 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: start_time: 2020-02-27 17:53:53.754119 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: end_time: 2020-02-27 17:55:53.751601 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: chunk_size: None 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: Prometheus Query: container_cpu:container_cpu_usage_seconds_total:rate:sum 2020-02-27 17:55:53,753:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): ace5a9030446111eaa51406535e498bf-a39d9303f84e17f8.elb.eu-west-1.amazonaws.com:80 2020-02-27 17:55:53,779:DEBUG:urllib3.connectionpool: http://xxxxxxx.amazonaws.com:80 "GET /api/v1/query?query=container_cpu%3Acontainer_cpu_usage_seconds_total%3Arate%3Asum%5B120s%5D&time=1582826154 HTTP/1.1" 200 197 2020-02-27 17:55:53,828:INFO:model: training data range: 2020-02-25 17:56:32.441999912 - 2020-02-27 17:55:32.441999912 2020-02-27 17:55:53,828:DEBUG:model: begin training Initial log joint probability = -159.531 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 2672.44 0.00662214 113.375 1 1 125 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 172 2680.56 0.00135966 109.968 1.218e-05 0.001 242 LS failed, Hessian reset 199 2680.92 0.00272163 111.415 1 1 272 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 208 2681.75 0.00213477 402.471 1.513e-05 0.001 359 LS failed, Hessian reset 278 2683.21 0.000620649 158.737 4.925e-06 0.001 483 LS failed, Hessian reset 299 2683.33 2.07698e-05 97.2539 0.5944 0.5944 509 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 329 2683.34 5.97714e-08 98.7893 0.6591 0.6591 549 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance 2020-02-27 17:55:56,653:DEBUG:model: yhat yhat_lower yhat_upper timestamp 2020-02-27 17:56:32.441999912 11.045056 -4.791490 27.309229 2020-02-27 17:57:32.441999912 11.037078 -4.108421 26.203213 "

4n4nd commented 4 years ago

Can you please visit the metrics page in your browser? i.e. http://prom_anomaly_detector_address:8080/metrics and check if you can see the predicted metrics. If you can see the predicted metrics there, can you check if your prometheus instance is configured to scrape the anomaly detector properly? You might need to add the anomaly detector's address in the the prometheus config some helpful docs are available here.

If none of this works, could you please show me your prometheus configuration?

devoprock commented 4 years ago

Thanks @4n4nd helping me on this issue.

1)http://172.xx.x.xx:8080/metrics -->I dont see any predicted metrics on page. 2) Can you please help me how can i add anomaly detector's address in the the prometheus config .Sorry ,I am not able make using attached document https://prometheus.io/docs/prometheus/latest/getting_started/#configuring-prometheus-to-monitor-the-sample-targets. 3)We are running prometheus on EKS kubernetes cluster.below is prometheus configuration from prometheus dashboard.


global: scrape_interval: 1m scrape_timeout: 10s evaluation_interval: 1m alerting: alertmanagers:

4n4nd commented 4 years ago

We recently added new annotations to the Service object (here), Can you please try using the updated template for the Anomaly detector service?

This might not work if Prometheus has been deployed in a different namespace than the prometheus anomaly detector. If that's the case you might need to add another job in your Prometheus config, something like:

- job_name: prometheus-anomaly-detector
  honor_labels: true
  scrape_interval: 1m
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  static_configs:
        - targets: ['route_to_anomaly_detector']
          labels:
            group: 'anomaly-detection'
4n4nd commented 4 years ago

1)http://172.xx.x.xx:8080/metrics -->I dont see any predicted metrics on page.

Can you check if you have the prometheus-anomaly-detector service exposed? If not, you can do it by using kubectl expose anomaly_detector_service_name -n namespace_where_deployed

devoprock commented 4 years ago

@4n4nd ,Sorry to ask you ,here what target value - targets: ['route_to_anomaly_detector'] host we need to update on Prometheus configuration on kubernetes ?

4n4nd commented 4 years ago

@nandhyala Is the Prometheus hosted in the same namespace as the anomaly detector?

devoprock commented 4 years ago

@4n4nd ,

Just i want explain how i am doing here to come up both are in same page.

1)My prometheus/grafana are running on Kubernetes cluster and collecting application metrics . 2)For prometheus application created ALB end point on cluster. 3)On AWS EC2 machine have installed prophet required rpm's. 4) Downloaded repo https://github.com/AICoE/prometheus-anomaly-detector.git on EC2 machine. 5) exported below env and executed python app.py script on EC2 machine. 6) Issue is i am not understading how caan i see predication graphs on prometheus dashboard for python app.py result.

export FLT_PROM_URL=http://axxxxxxxx.elb.eu-west-1.amazonaws.com export FLT_RETRAINING_INTERVAL_MINUTES=5 export FLT_ROLLING_TRAINING_WINDOW_SIZE=1d export FLT_METRICS_LIST="container_cpu:container_cpu_usage_seconds_total:rate:sum" Can you please review above steps and correct me if i am doing wrong here.

4n4nd commented 4 years ago

This application starts a web server and exposes metrics on port 8080. After step 5, I think you need to create an ALB endpoint for the anomaly detection application, just like prometheus. Then you might get a url similar to http://axxxxxxxx.elb.eu-west-1.amazonaws.com which you can use in the prometheus config. Sorry I don't have much experience with AWS EC2. If you can, use the deployment configurations in the openshift dir, to deploy the anomaly detector in your kubernetes cluster.

devoprock commented 4 years ago

@4n4nd ,That means i need to create load balancer on EC2 machine app.py script is running ?

4n4nd commented 4 years ago

Yes, because prometheus needs some kind of address that it can use to reach the anomaly detector.

devoprock commented 4 years ago

got it @4n4nd ,I will create ALB on ec2 machine .

4n4nd commented 4 years ago

you can test if the app is working by curling its ip inside the EC2 instance.

curl localhost:8080/metrics from inside the EC2 instance

devoprock commented 4 years ago

[root@ip-xxxx~]# curl localhost:8080

HELP container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet Forecasted value from Prophet model

TYPE container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet gauge

container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="yhat"} -11.264086645780477 container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="yhat_lower"} -19.96119988783143 container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="yhat_upper"} -2.6234446516765177 container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="anomaly"} 1.0 [root@ip-xxxx ~]#

4n4nd commented 4 years ago

exactly, these are your predicted values. But the prometheus needs to be able to access this page from the outside.

devoprock commented 4 years ago

one more question on prediction values ,I see only yhat & yhat_lower & yhat_upper but i dont see predicted metrics value ...In repo exmaple graph i see predicted metrics value line.

4n4nd commented 4 years ago

yhat is the predicted value. yhat_upper is the upper bound yhat_lower is the lower bound

devoprock commented 4 years ago

Really Thank you very much @4n4nd helping me on step .I will create alb & will place new alb in Prometheus config file and will test.If still i face any issues i will update here.

4n4nd commented 4 years ago

Once you create the endpoint, check it with curl if it is accessible

devoprock commented 4 years ago

Sure @4n4nd

devoprock commented 4 years ago

I think i no needed ALB on poc server bcz I am able to get anomaly detector metrics using ec2 http://host-ip:8080 from my prometues pods servers.Just I will place http://host-ip:8080 on prometheus config and test.

4n4nd commented 4 years ago

good idea :+1:

devoprock commented 4 years ago

@4n4nd ,Now i am able to see prophet yhat & yhat_lower & yhar_upper metrics on prometues dash board. But i dont have enough data to do training ...Once i have data by next week on cluster application i will run and see prediction .

Thanks you once again for your help on here ...May be i will reach out you if i need any help on prediction with my data :)

4n4nd commented 4 years ago

No problem 👍 Do you want to close this issue then?

devoprock commented 4 years ago

yes @4n4nd we can close this issue

Carmezim commented 3 years ago

@4n4nd I've got the anomaly detector running on K8s in the same namespace as the prometheus instance and am able to get a successful response calling wget localhost:8000/metrics from within the anomaly detector, but the prometheus instance cannot read from it getting connection refused.

# anomaly detector instance

--2020-12-21 03:57:27--  http://localhost:8080/metrics
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1895 (1.9K) [text]
metrics: Is a directory

The anomaly detector is on:

spec:
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 8080
  clusterIP: ********
  type: ClusterIP

Wondering if there might be anything else I'm missing. Thank you.

4n4nd commented 3 years ago

@Carmezim i think this might be an issue in the way PAD is deployed in your k8s env. Can you post your deployment manifest? I think for some reason the port 8080 for your pod is not being forwarded.