GoogleCloudPlatform / k8s-stackdriver

Apache License 2.0
391 stars 213 forks source link

log spamming with horizontal pod autoscaler and custom-metrics-stackdriver-adapter #318

Open HugoTigre opened 4 years ago

HugoTigre commented 4 years ago

I'm currently using Horizontal Pod Autoscaler (in google cloud) implemented with custom metrics, so custom-metrics-stackdriver-adapter is installed from here

The problem is that it's generating more than 10 log messages a second with the following errors:

jsonPayload: {
  message: "apiserver was unable to write a JSON response: http2: stream closed"   
  pid: "1"   
  source: "writers.go:172"   
 }

and

jsonPayload: {
  message: "apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}"   
  pid: "1"   
  source: "status.go:71"   
 }

The HPA is working as expected, so the amount of errors is very strange and I couldn't found a reason for it, not could I find documentation on how to change this, or even change the amount of requests periodicity, not in HPA nor in this adapter.

HPA is configured as follows:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: xxx
  namespace: xxx
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: xxx
  minReplicas: 2
  maxReplicas: 3
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 40

Kubernetes version is: 1.15

Is there any reason for this. It looks like a bug.

Also this issue seems to be related: https://github.com/GoogleCloudPlatform/k8s-stackdriver/issues/303

JBodkin-LH commented 4 years ago

We've been seeing the same issue when using this adapter and autoscaling based on pubsub undelivered messages

msgongora commented 3 years ago

We facing this use as well, same use case.

lechen26 commented 3 years ago

same here.

we have HPA on most of our services based on custom metric (external). GKE version v1.17.15-gke.800 and gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.8.0

it is working but we have a lot of errors on GKE events from the kind: unable to fetch metrics from external metrics API: the server is currently unable to handle the request

on the custom metrics log the log is pretty not useful as its just FULL with the following:

apiserver was unable to write a JSON response: http2: stream closed
apiserver received an error that is not an metav1.Status: http2: stream closed

i've notices once this custom-metrics-stackdriver evicted and restarted we got the unable to handle request error, but also when its just running every few hours or minutes we get the errors and the hpa works but i suspect its not working as efficient as it used to be.

BTW, same happens on another cluster GKE version v1.17.14-gke.1600 and gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.2

any idea what's going on? thanks

lechen26 commented 3 years ago

anything here?

trucolo commented 3 years ago

same issue here

We are trying to use HPA with the same metric as @JBodkin-LH and I'm getting a lot of those errors, it seems the metrics are working fine, but that amount of error logs might hide other issues...

rajithavk commented 3 years ago

Is there a fix for this or a way to silence these logs? we've already run into a surge in costs due to this spamming issue.

masterlog80 commented 3 years ago

For Google Cloud, it's possible to set Logs Exclusion for a specific pattern: https://cloud.google.com/logging/docs/exclusions

bboykk1234 commented 3 years ago

same issue here

We setup the HPA following this guideline https://cloud.google.com/kubernetes-engine/docs/tutorials/autoscaling-metrics

any idea what's going on?

shpml commented 2 years ago

Any update?

Gwojda commented 2 years ago

Update ?

eric3chang commented 2 years ago

I'm also running into this issue :-(

asychev commented 2 years ago

Same for us. Any reaction from maintainers?

brianpham commented 2 years ago

I am seeing the same issue as well following the guide found here https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/custom-metrics-stackdriver-adapter. Anyone figure out a way to fix the error messages above or is this something we can ignore?

Running v1.21.5-gke.1302 for control plane and nodes with workload identity enabled.

stenicke commented 2 years ago

Same for me. Any update?

naxo8628 commented 2 years ago

+1

muscovitebob commented 1 year ago

Getting surprise spam cloud logging bills from this issue, except this is autodeployed as part of Cloud Composer.

kwiesmueller commented 1 year ago

@muscovitebob please reach out to cloud support for any issues caused by a managed product and related billing issues.

In general when managing this component yourself, check your adapters memory utilization. If it is running close to the memory limit this can be a symptom. Also check the resources provided to the adapter in general and see if increasing them reduces the frequency of these errors (feel free to share learnings here).

If you are not seeing any data reaching the apiserver from the component, checking your networking rules/firewalls can also help to find what is causing traffic to get lost. Often these errors just mean the adapter can not respond in time or at all.

alina-bylkova commented 1 year ago

Same issue with stackdriver version gcr.io/gke-release/custom-metrics-stackdriver-adapter:v0.13.1-gke.0 and k8s version 1.23.14-gke.401

don-toptal commented 1 year ago

+1

Ture2019 commented 1 year ago

Hi, we experience the same issue in two different environments. This produces ~10.000 error messages pr hour. This drowns any useful error message and causes higher than neccessary costs. Quite an important issue so to say. Quite disappointing to see that has not been solved in 2 1/2 years, and is not more prioritised! Workaround for our application is to go back to composer version 1. We are happy to provide more information if anybody is willing to take on this issue. "old prod env":

Steps to reproduce the issue:

  1. Grant access:
    gcloud projects add-iam-policy-binding kolumbus-atl-prod \
    --member=serviceAccount:service-123456789@cloudcomposer-accounts.iam.gserviceaccount.com \
    --role=roles/composer.ServiceAgentV2Ext
  2. Create environment:
    gcloud composer environments create kolumbus-composer5 \
    --location=europe-west1 \
    --image-version=composer-2.2.0-airflow-2.5.1 \
    --environment-size=small \
    --maintenance-window-start='2023-05-25T17:30:00Z' \
    --maintenance-window-end='2023-05-25T21:30:00Z' \
    --maintenance-window-recurrence='FREQ=DAILY'
  3. Look at Logs Explorer, filter by Error.
davidxia commented 1 year ago

I don't think GCP teams look at or are notified of or maybe just don't care about GitHub.com comments and issues. The most effective way to get them to fix things is to create a partner issue on their internal tracking system or GCP support case if you're a paying GCP customer (paying more money correlates to faster response time) and linking back to this issue.

rajithavk commented 1 year ago

Well I have bad news, they don't care even you pay them 🤧 gcp seems to be losing to other to key players.

On Thu, May 25, 2023, 22:25 David Xia @.***> wrote:

I don't think GCP teams look at or are notified of or maybe just don't care about GitHub.com comments and issues. The most effective way to get them to fix things is to create a partner issue on their internal tracking system or GCP support case if you're a paying GCP customer (paying more money correlates to faster response time) and linking back to this issue.

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/k8s-stackdriver/issues/318#issuecomment-1563223193, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIECJTHV6TVVZZM7SFPG7DXH6FJFANCNFSM4L2Y5CQA . You are receiving this because you commented.Message ID: @.***>

Ture2019 commented 1 year ago

I added a note to the corresponding composer bug report: https://issuetracker.google.com/issues/159171905 Please upvote and comment you, too.