AICoE / prometheus-anomaly-detector

A newer more updated version of the prometheus anomaly detector (https://github.com/AICoE/prometheus-anomaly-detector-legacy)
GNU General Public License v3.0
596 stars 150 forks source link

yhat & yhat_lower & yhat_upper prediction values is not accurate with current original values. #112

Closed devoprock closed 2 years ago

devoprock commented 4 years ago

@4n4nd ,

I am plaining to use yhat value todo Kuberentes pod autoscaling (at peak time it will scale up 1000 + pods) using Prometheus original metrics (container_cpu:container_cpu_usage_seconds_total:rate) as prediction yhat values in HPA.

I have passed below variables on prophet application node and started app.py.

export FLT_PROM_URL=http://xxxxxxx.amazonaws.com export FLT_RETRAINING_INTERVAL_MINUTES=1 export FLT_ROLLING_TRAINING_WINDOW_SIZE=3d export FLT_METRICS_LIST="container_cpu:container_cpu_usage_seconds_total:rate:sum"

container_cpu:container_cpu_usage_seconds_total:rate:sum = sum (rate (container_cpu_usage_seconds_total{container_name="smart-savant-cpu"}[5m]))

Below issue i have noticed on above steps:

1)I am able to see yhat & yhat_lower & yhat_upper prediction values on prometheus but some how i am not seeing correct prediction values for yhat & yhat_lower & yhat_upper on prometheus/grafanan compare to original "container_cpu:container_cpu_usage_seconds_total:rate:sum" metrics .

I have attached grafana dashboaard screen shot here. Green bar is :container_cpu:container_cpu_usage_seconds_total:rate:sum Sky blue bar is : yhat Red Bar is : yhat_upper Orange Bar is :yhat_lower Bottom Bar is : Anomaly detector.

Can you please help what i am doing wrong on above ,Can you please review and let me know how can i fix ?

2) How can i see furture forecast data on prometheus/grafana dashboard ?

3)How can use daily & weekly & Holiday data setting if is required & what is default prediction is doing on app.py ?

Can you please help me on above things to setup Kuberntess HPAA autoscaling on infrastructure ?

Thanks 🙏

devoprock commented 4 years ago

Screen Shot 2020-03-01 at 10 06 30 AM

4n4nd commented 4 years ago

1)I am able to see yhat & yhat_lower & yhat_upper prediction values on prometheus but some how i am not seeing correct prediction values for yhat & yhat_lower & yhat_upper on prometheus/grafanan compare to original "container_cpu:container_cpu_usage_seconds_total:rate:sum" metrics . I have attached grafana dashboaard screen shot here. Green bar is :container_cpu:container_cpu_usage_seconds_total:rate:sum Sky blue bar is : yhat Red Bar is : yhat_upper Orange Bar is :yhat_lower Bottom Bar is : Anomaly detector. Can you please help what i am doing wrong on above ,Can you please review and let me know how can i fix ?

Hey I don't really get what isn't working. To me it looks like everything is working as intended.

  1. How can i see furture forecast data on prometheus/grafana dashboard ?

Currently we don't have any way to do this.

3)How can use daily & weekly & Holiday data setting if is required & what is default prediction is doing on app.py ?

You can configure the model in the model.py file.

devoprock commented 4 years ago

@4n4nd ,

I dont see any issue with setup i see the yhat & yhat_lower & yhat_upper prediction values on prometheus but some how i see lot of difference between original values and yhat & yhat_lower & yhat_upper .

Screen Shot 2020-03-02 at 9 41 52 AM

See attached screen shot in above: Green bar is : original container_cpu:container_cpu_usage_seconds_total:rate:sum Sky blue bar is : yhat Red Bar is : yhat_upper Orange Bar is :yhat_lower Bottom Bar is : Anomaly detector.

4n4nd commented 4 years ago

Screen Shot 2020-03-01 at 10 06 30 AM

That seems to be doing well

Or you can try giving the model more data, export FLT_ROLLING_TRAINING_WINDOW_SIZE=30d

and I don't think you need to retrain your model every minute, export FLT_RETRAINING_INTERVAL_MINUTES=15

This will look at the past 30 days of data and retrain the model every 15 minutes.

Hope this is helpful, otherwise you can try tweaking the model.py which is from https://facebook.github.io/prophet/

devoprock commented 4 years ago

Sure @4n4nd I will try with below values ..Just i want let you know we have only last 4 days data in prometheus and see below screen shot i see lot of differences between original values & yhat's.

export FLT_ROLLING_TRAINING_WINDOW_SIZE=30d export FLT_RETRAINING_INTERVAL_MINUTES=15

Screen Shot 2020-03-02 at 9 41 52 AM

Green bar is : original container_cpu:container_cpu_usage_seconds_total:rate:sum Sky blue bar is : yhat Red Bar is : yhat_upper Orange Bar is :yhat_lower Bottom Bar is : Anomaly detector.

4n4nd commented 4 years ago

let the anomaly detector collect a few days of data and then see if the predictions improve

4n4nd commented 4 years ago

Just i want let you know we have only last 4 days

It should keep accumulating the data until it has 30 days of data.

hemajv commented 4 years ago

@nandhyala By default, the model being trained is the Prophet model. You can also try training the Fourier model by importing the model_fourier.py in your app.py and compare between the two. As @4n4nd mentioned, due to the small amount of training data you have, it might be affecting the model's performance.

devoprock commented 4 years ago

@4n4nd @hemajv , Really appreciate your help ! Now i am running app.py script from last 4 days and in progress and dont see much improvements yhat & yhat_upper & yhat_lower values with original values.

Can you please confirm we cant predict or training data with 1 week of data using prophet ? & how can importing the model_fourier.py in app.py ?

Screen Shot 2020-03-04 at 9 00 34 PM

hemajv commented 4 years ago

@nandhyala I can see that few anomalies were detected. Since the anomaly is a 0 or 1 value, try increasing the scale for the 'anomaly' line (i.e. the orange line) on the graph so that you can visualize it better. You can predict on training data of 1 week, but the performance of the model may be better with more training data.

For training the Fourier model, change https://github.com/AICoE/prometheus-anomaly-detector/blob/master/app.py#L13 to: import model_fourier as model

devoprock commented 4 years ago

Thanks @hemajv ,I will check with 2 weeks data and come back if i see same issues.

Parallel i will check model_fourier as per above link and let you know.

devoprock commented 4 years ago

@hemajv ,

I have tried with model_fourier with 2 days old data but i dont see much improvement with fourier model . Below original & fouier metrics graph in grafana.

Screen Shot 2020-03-10 at 2 00 32 PM

hemajv commented 4 years ago

@nandhyala So this is one of the drawbacks with Fourier, it is more of a statistical extrapolation of the values vs the Prophet model which takes into account seasonality or trend in your data. One model may perform better than the other depending on the nature of the time series metric. I would still recommend collecting at least >2 weeks of data and re-training the Prophet/Fourier models.

devoprock commented 4 years ago

Thanks @hemajv ,I will check same Prophet/Fourier models. once i get 2 weeks data.

sesheta commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

sesheta commented 2 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta commented 2 years ago

@sesheta: Closing this issue.

In response to [this](https://github.com/AICoE/prometheus-anomaly-detector/issues/112#issuecomment-1003431216): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.