Closed devoprock closed 4 years ago
In your Prometheus targets list, can you check if it was successfully able to scrape the anomaly-detector?
@4n4nd Thanks Anand Sanmukhani ,In Prometheus targets list dash board i dont see any anomaly-detector entire .
python3 app.py script i see yhat yhat_lower yhat_upper values but same thing i am not seeing Prometheus.Please help on this issue.
" 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: start_time: 2020-02-27 17:53:53.754119 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: end_time: 2020-02-27 17:55:53.751601 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: chunk_size: None 2020-02-27 17:55:53,751:DEBUG:prometheus_api_client.prometheus_connect: Prometheus Query: container_cpu:container_cpu_usage_seconds_total:rate:sum 2020-02-27 17:55:53,753:DEBUG:urllib3.connectionpool: Starting new HTTP connection (1): ace5a9030446111eaa51406535e498bf-a39d9303f84e17f8.elb.eu-west-1.amazonaws.com:80 2020-02-27 17:55:53,779:DEBUG:urllib3.connectionpool: http://xxxxxxx.amazonaws.com:80 "GET /api/v1/query?query=container_cpu%3Acontainer_cpu_usage_seconds_total%3Arate%3Asum%5B120s%5D&time=1582826154 HTTP/1.1" 200 197 2020-02-27 17:55:53,828:INFO:model: training data range: 2020-02-25 17:56:32.441999912 - 2020-02-27 17:55:32.441999912 2020-02-27 17:55:53,828:DEBUG:model: begin training Initial log joint probability = -159.531 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 2672.44 0.00662214 113.375 1 1 125 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 172 2680.56 0.00135966 109.968 1.218e-05 0.001 242 LS failed, Hessian reset 199 2680.92 0.00272163 111.415 1 1 272 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 208 2681.75 0.00213477 402.471 1.513e-05 0.001 359 LS failed, Hessian reset 278 2683.21 0.000620649 158.737 4.925e-06 0.001 483 LS failed, Hessian reset 299 2683.33 2.07698e-05 97.2539 0.5944 0.5944 509 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 329 2683.34 5.97714e-08 98.7893 0.6591 0.6591 549 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance 2020-02-27 17:55:56,653:DEBUG:model: yhat yhat_lower yhat_upper timestamp 2020-02-27 17:56:32.441999912 11.045056 -4.791490 27.309229 2020-02-27 17:57:32.441999912 11.037078 -4.108421 26.203213 "
Can you please visit the metrics page in your browser?
i.e. http://prom_anomaly_detector_address:8080/metrics
and check if you can see the predicted metrics.
If you can see the predicted metrics there, can you check if your prometheus instance is configured to scrape the anomaly detector properly?
You might need to add the anomaly detector's address in the the prometheus config some helpful docs are available here.
If none of this works, could you please show me your prometheus configuration?
Thanks @4n4nd helping me on this issue.
1)http://172.xx.x.xx:8080/metrics -->I dont see any predicted metrics on page. 2) Can you please help me how can i add anomaly detector's address in the the prometheus config .Sorry ,I am not able make using attached document https://prometheus.io/docs/prometheus/latest/getting_started/#configuring-prometheus-to-monitor-the-sample-targets. 3)We are running prometheus on EKS kubernetes cluster.below is prometheus configuration from prometheus dashboard.
global: scrape_interval: 1m scrape_timeout: 10s evaluation_interval: 1m alerting: alertmanagers:
We recently added new annotations to the Service
object (here), Can you please try using the updated template for the Anomaly detector service?
This might not work if Prometheus has been deployed in a different namespace than the prometheus anomaly detector. If that's the case you might need to add another job in your Prometheus config, something like:
- job_name: prometheus-anomaly-detector
honor_labels: true
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets: ['route_to_anomaly_detector']
labels:
group: 'anomaly-detection'
1)http://172.xx.x.xx:8080/metrics -->I dont see any predicted metrics on page.
Can you check if you have the prometheus-anomaly-detector service exposed?
If not, you can do it by using kubectl expose anomaly_detector_service_name -n namespace_where_deployed
@4n4nd ,Sorry to ask you ,here what target value - targets: ['route_to_anomaly_detector'] host we need to update on Prometheus configuration on kubernetes ?
@nandhyala Is the Prometheus hosted in the same namespace as the anomaly detector?
@4n4nd ,
Just i want explain how i am doing here to come up both are in same page.
1)My prometheus/grafana are running on Kubernetes cluster and collecting application metrics . 2)For prometheus application created ALB end point on cluster. 3)On AWS EC2 machine have installed prophet required rpm's. 4) Downloaded repo https://github.com/AICoE/prometheus-anomaly-detector.git on EC2 machine. 5) exported below env and executed python app.py script on EC2 machine. 6) Issue is i am not understading how caan i see predication graphs on prometheus dashboard for python app.py result.
export FLT_PROM_URL=http://axxxxxxxx.elb.eu-west-1.amazonaws.com export FLT_RETRAINING_INTERVAL_MINUTES=5 export FLT_ROLLING_TRAINING_WINDOW_SIZE=1d export FLT_METRICS_LIST="container_cpu:container_cpu_usage_seconds_total:rate:sum" Can you please review above steps and correct me if i am doing wrong here.
This application starts a web server and exposes metrics on port 8080. After step 5, I think you need to create an ALB endpoint for the anomaly detection application, just like prometheus. Then you might get a url similar to http://axxxxxxxx.elb.eu-west-1.amazonaws.com which you can use in the prometheus config. Sorry I don't have much experience with AWS EC2. If you can, use the deployment configurations in the openshift dir, to deploy the anomaly detector in your kubernetes cluster.
@4n4nd ,That means i need to create load balancer on EC2 machine app.py script is running ?
Yes, because prometheus needs some kind of address that it can use to reach the anomaly detector.
got it @4n4nd ,I will create ALB on ec2 machine .
you can test if the app is working by curling its ip inside the EC2 instance.
curl localhost:8080/metrics
from inside the EC2 instance
[root@ip-xxxx~]# curl localhost:8080
container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="yhat"} -11.264086645780477 container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="yhat_lower"} -19.96119988783143 container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="yhat_upper"} -2.6234446516765177 container_cpu:container_cpu_usage_seconds_total:rate:sum_prophet{value_type="anomaly"} 1.0 [root@ip-xxxx ~]#
exactly, these are your predicted values. But the prometheus needs to be able to access this page from the outside.
one more question on prediction values ,I see only yhat & yhat_lower & yhat_upper but i dont see predicted metrics value ...In repo exmaple graph i see predicted metrics value line.
yhat
is the predicted value.
yhat_upper
is the upper bound
yhat_lower
is the lower bound
Really Thank you very much @4n4nd helping me on step .I will create alb & will place new alb in Prometheus config file and will test.If still i face any issues i will update here.
Once you create the endpoint, check it with curl if it is accessible
Sure @4n4nd
I think i no needed ALB on poc server bcz I am able to get anomaly detector metrics using ec2 http://host-ip:8080 from my prometues pods servers.Just I will place http://host-ip:8080 on prometheus config and test.
good idea :+1:
@4n4nd ,Now i am able to see prophet yhat & yhat_lower & yhar_upper metrics on prometues dash board. But i dont have enough data to do training ...Once i have data by next week on cluster application i will run and see prediction .
Thanks you once again for your help on here ...May be i will reach out you if i need any help on prediction with my data :)
No problem 👍 Do you want to close this issue then?
yes @4n4nd we can close this issue
@4n4nd I've got the anomaly detector running on K8s in the same namespace as the prometheus instance and am able to get a successful response calling wget localhost:8000/metrics
from within the anomaly detector, but the prometheus instance cannot read from it getting connection refused
.
# anomaly detector instance
--2020-12-21 03:57:27-- http://localhost:8080/metrics
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1895 (1.9K) [text]
metrics: Is a directory
The anomaly detector is on:
spec:
ports:
- name: http
protocol: TCP
port: 8080
targetPort: 8080
clusterIP: ********
type: ClusterIP
Wondering if there might be anything else I'm missing. Thank you.
@Carmezim i think this might be an issue in the way PAD is deployed in your k8s env. Can you post your deployment manifest? I think for some reason the port 8080 for your pod is not being forwarded.
I have followed below document i setup prophet on EC2 machine and pulling metrics(FLT_METRICS_LIST) from prometheus end point(FLT_PROM_URL) and some how i dont see any result on predication metrics reports with prophet predication.
I see below continues below o/p on prophet machine. Can you please some one help on this ?
"Convergence detected: relative gradient magnitude is below tolerance 2020-02-27 05:05:29,945:INFO:main: Total Training time taken = 0:00:13.435543, for metric: container_cpu:container_cpu_usage_seconds_total:rate:sum {} 2020-02-27 05:07:30,086:INFO:schedule: Running job Every 2 minutes do train_model(initial_run=False, data_queue=<multiprocessing.queues.Queue object at 0x7fed042683d0>) (last run: 2020-02-27 05:05:29, next run: 2020-02-27 05:07:29) 2020-02-27 05:07:30,158:INFO:model: training data range: 2020-02-16 18:36:32.441999912 - 2020-02-27 05:06:32.441999912 Initial log joint probability = -624.556 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 14102.9 0.0341506 242.469 1 1 116 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 199 14111.8 0.00151911 266.794 1 1 244 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 299 14116.3 0.00037893 126.022 0.6299 0.6299 358 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 399 14118.4 0.000192582 86.9727 0.3648 1 479 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 499 14122.9 0.00185006 157.717 1 1 593 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 587 14124.5 0.000250693 238.923 2.704e-06 0.001 742 LS failed, Hessian reset 599 14124.6 6.07051e-05 83.8182 0.7625 0.7625 755 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 645 14124.6 5.48574e-07 86.1275 0.3767 1 812 Optimization terminated normally: Convergence detected: relative gradient magnitude is below tolerance 2020-02-27 05:07:39,472:INFO:main: Total Training time taken = 0:00:09.362668, for metric: container_cpu:container_cpu_usage_seconds_total:rate:sum {}"