GoogleCloudPlatform / k8s-config-connector

GCP Config Connector, a Kubernetes add-on for managing GCP resources
https://cloud.google.com/config-connector/docs/overview
Apache License 2.0
893 stars 222 forks source link

configconnector_applied_resources_total not always available #882

Open ShaunMaxwell opened 1 year ago

ShaunMaxwell commented 1 year ago

Checklist

Bug Description

The configconnector_applied_resources_total metric is not always available. It appears for ~8 minutes and then disappears for ~10 minutes.

image

I have tried using kubectl port-forward to expose the pod on localhost and used curl to retrieve the metrics myself and confirmed that the whole series of configconnector_applied_resources_total metrics disappears for a few minutes.

I suspect the disappearance is due to the resetting of the metric and the length of time it is unavailable for is due to client-side throttling of the Kubernetes API requests causing a complete run of the metrics endpoint to take a long time.

In the logs of the cnrm-resource-stats-recorder pod, there are a lot of throttling messages and errors related to missing resources for some CRDs.

Additional Diagnostic Information

None

Kubernetes Cluster Version

v1.25.10-gke.2700

Config Connector Version

1.108.0

Config Connector Mode

cluster mode

Log Output

I1003 11:14:55.226212       1 request.go:601] Waited for 5.196118125s due to client-side throttling, not priority and fairness, request: GET:https://10.35.24.1:443/apis/sourcerepo.cnrm.cloud.google.com/v1beta1?timeout=32s
E1003 11:14:57.131915       1 main.go:149] setup "msg"="error recording metrics for CRD %v: %v" "error"="error listing objects for dialogflowcx.cnrm.cloud.google.com/v1alpha1, Kind=DialogflowCXPage: error listing objects:no matches for kind \"DialogflowCXPage\" in version \"dialogflowcx.cnrm.cloud.google.com/v1alpha1\"" "dialogflowcx.cnrm.cloud.google.com/v1alpha1, Kind=DialogflowCXPage"="(MISSING)"
E1003 11:15:04.332728       1 main.go:149] setup "msg"="error recording metrics for CRD %v: %v" "error"="error listing objects for firebase.cnrm.cloud.google.com/v1alpha1, Kind=FirebaseWebApp: error listing objects:no matches for kind \"FirebaseWebApp\" in version \"firebase.cnrm.cloud.google.com/v1alpha1\"" "firebase.cnrm.cloud.google.com/v1alpha1, Kind=FirebaseWebApp"="(MISSING)"
I1003 11:15:05.476080       1 request.go:601] Waited for 1.04591486s due to client-side throttling, not priority and fairness, request: GET:https://10.35.24.1:443/apis/mesh.cloud.google.com/v1alpha1?timeout=32s

Steps to reproduce the issue

  1. Install Config Connector
  2. Setup Prometheus to scrape the annotated services
  3. Create at least one Config Connector resource
  4. Wait for Prometheus to gather some data (~15 minutes)
  5. Graph the configconnector_applied_resources_total metric

YAML snippets

No response

diviner524 commented 1 year ago

@justinsb I remember you had looked into similar issues before and attempted a refactoring to make the cnrm-resource-stats-recorder controller more performant. This looks to be related?

petarpavaolacic commented 1 year ago

I have encountered the same issue with version 1.110. It appears that cnrm-resource-stats-recorder is attempting to access v1alpha1 CRDs that need to be manually installed(https://cloud.google.com/config-connector/docs/how-to/install-alpha-crds).

nweisenauer-sap commented 8 months ago

I have encountered the same issue with version 1.110. It appears that cnrm-resource-stats-recorder is attempting to access v1alpha1 CRDs that need to be manually installed(https://cloud.google.com/config-connector/docs/how-to/install-alpha-crds).

Is there any way to prevent the cnrm-resource-stats-recorder from looking for the v1alpha1 CRDs? We do not want to maintain them manually. Right now our cnrm-resource-stats-recorder is flooding our log system with thousands of error logs when looking for these CRDs every minute.