configconnector_applied_resources_total not always available

ShaunMaxwell commented 1 year ago

Checklist

[X] I did not find a related open issue.
[X] I did not find a solution in the troubleshooting guide: (https://cloud.google.com/config-connector/docs/troubleshooting)
[X] If this issue is time-sensitive, I have submitted a corresponding issue with GCP support.

Bug Description

The configconnector_applied_resources_total metric is not always available. It appears for ~8 minutes and then disappears for ~10 minutes.

I have tried using kubectl port-forward to expose the pod on localhost and used curl to retrieve the metrics myself and confirmed that the whole series of configconnector_applied_resources_total metrics disappears for a few minutes.

I suspect the disappearance is due to the resetting of the metric and the length of time it is unavailable for is due to client-side throttling of the Kubernetes API requests causing a complete run of the metrics endpoint to take a long time.

In the logs of the cnrm-resource-stats-recorder pod, there are a lot of throttling messages and errors related to missing resources for some CRDs.

Additional Diagnostic Information

None

Kubernetes Cluster Version

v1.25.10-gke.2700

Config Connector Version

1.108.0

Config Connector Mode

cluster mode

Log Output

I1003 11:14:55.226212       1 request.go:601] Waited for 5.196118125s due to client-side throttling, not priority and fairness, request: GET:https://10.35.24.1:443/apis/sourcerepo.cnrm.cloud.google.com/v1beta1?timeout=32s
E1003 11:14:57.131915       1 main.go:149] setup "msg"="error recording metrics for CRD %v: %v" "error"="error listing objects for dialogflowcx.cnrm.cloud.google.com/v1alpha1, Kind=DialogflowCXPage: error listing objects:no matches for kind \"DialogflowCXPage\" in version \"dialogflowcx.cnrm.cloud.google.com/v1alpha1\"" "dialogflowcx.cnrm.cloud.google.com/v1alpha1, Kind=DialogflowCXPage"="(MISSING)"
E1003 11:15:04.332728       1 main.go:149] setup "msg"="error recording metrics for CRD %v: %v" "error"="error listing objects for firebase.cnrm.cloud.google.com/v1alpha1, Kind=FirebaseWebApp: error listing objects:no matches for kind \"FirebaseWebApp\" in version \"firebase.cnrm.cloud.google.com/v1alpha1\"" "firebase.cnrm.cloud.google.com/v1alpha1, Kind=FirebaseWebApp"="(MISSING)"
I1003 11:15:05.476080       1 request.go:601] Waited for 1.04591486s due to client-side throttling, not priority and fairness, request: GET:https://10.35.24.1:443/apis/mesh.cloud.google.com/v1alpha1?timeout=32s

Steps to reproduce the issue

Install Config Connector
Setup Prometheus to scrape the annotated services
Create at least one Config Connector resource
Wait for Prometheus to gather some data (~15 minutes)
Graph the configconnector_applied_resources_total metric

YAML snippets

No response

diviner524 commented 1 year ago

@justinsb I remember you had looked into similar issues before and attempted a refactoring to make the cnrm-resource-stats-recorder controller more performant. This looks to be related?

petarpavaolacic commented 1 year ago

I have encountered the same issue with version 1.110. It appears that cnrm-resource-stats-recorder is attempting to access v1alpha1 CRDs that need to be manually installed(https://cloud.google.com/config-connector/docs/how-to/install-alpha-crds).

nweisenauer-sap commented 8 months ago

I have encountered the same issue with version 1.110. It appears that cnrm-resource-stats-recorder is attempting to access v1alpha1 CRDs that need to be manually installed(https://cloud.google.com/config-connector/docs/how-to/install-alpha-crds).

Is there any way to prevent the cnrm-resource-stats-recorder from looking for the v1alpha1 CRDs? We do not want to maintain them manually. Right now our cnrm-resource-stats-recorder is flooding our log system with thousands of error logs when looking for these CRDs every minute.

GoogleCloudPlatform / k8s-config-connector