GoogleCloudPlatform / cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Apache License 2.0
1.28k stars 349 forks source link

Intermittent errors logged after enabling telemetry #2018

Open tomassommareqt opened 1 year ago

tomassommareqt commented 1 year ago

Bug Description

We are running gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.7.0 as a container next to our main http api container for connectivity to our CloudSQL instance.

After enabling telemetry using the --telemetry-project and -telemetry-prefix flags we have recurrently gotten the following error logged:

2023/11/04 13:58:43 Failed to export to Stackdriver: rpc error: code = Internal desc = One or more TimeSeries could not be written: Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.: global{} timeSeries[0]: custom.googleapis.com/opencensus/<redacted>_cloud_sql_proxy/cloudsqlconn/refresh_success_count{opencensus_task:go-1@<redacted>,cloudsql_instance:<redacted>}; Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.: global{} timeSeries[1]: custom.googleapis.com/opencensus/<redacted>_cloud_sql_proxy/cloudsqlconn/dial_latency{cloudsql_instance:<redacted>,opencensus_task:go-1@<redacted>}

However when expecting the metrics we can see that it works as expected. So this is mostly causes the issue of polluted logs. But it would also be interesting to understand why this error is reported.

Example code (or command)

// paste your code or command here

Stacktrace

`2023/11/04 13:58:43 Failed to export to Stackdriver: rpc error: code = Internal desc = One or more TimeSeries could not be written: Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.: global{} timeSeries[0]: custom.googleapis.com/opencensus/<redacted>_cloud_sql_proxy/cloudsqlconn/refresh_success_count{opencensus_task:go-1@<redacted>,cloudsql_instance:<redacted>}; Internal error encountered. Please retry after a few seconds. If internal errors persist, contact support at https://cloud.google.com/support/docs.: global{} timeSeries[1]: custom.googleapis.com/opencensus/<redacted>_cloud_sql_proxy/cloudsqlconn/dial_latency{cloudsql_instance:<redacted>,opencensus_task:go-1@<redacted>}`

Steps to reproduce?

  1. Launch cloud-sql-proxy 2.7.0 as a container in GCP GKE.
  2. Inspect logs.

Environment

  1. OS type and version: GCP GKE 1.24.14-gke.2700
  2. Cloud SQL Proxy version: 2.7.0
  3. Proxy invocation command: `apiVersion: apps/v1 kind: Deployment metadata: name: spec: template: spec: containers:
    • name: cloudsql-proxy image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.7.0 args:
      • "--auto-iam-authn"
      • "--max-sigterm-delay"
      • "25s"
      • "--structured-logs"
      • "--telemetry-project"
      • ""
      • "--telemetry-prefix"
      • "_cloud_sql_proxy"
      • "" `

Additional Details

No response

enocom commented 1 year ago

Thanks @tomassommareqt. FWIW I have seen the same logs when working on this feature. I don't expect these logs to show up outside of a dev context, though. We'll investigate and fix this.

rojomisin commented 8 months ago

still seen in 2.9.0 although the metrics work using metrics writer role

2024/03/15 23:10:50 Failed to export to Stackdriver: rpc error: code = PermissionDenied desc = The caller does not have permission
enocom commented 8 months ago

Thanks, @rojomisin. We still haven't got to this. I wonder if this is race condition in OpenCensus itself.

rojomisin commented 8 months ago

perhaps fixed in OpenTelemetry pkg? https://github.com/open-telemetry/opentelemetry-go-contrib

enocom commented 7 months ago

Quite possibly. We're currently using OpenCensus given that some internal tooling that uses the Proxy has a big investment in OpenCensus. But we might revisit that decision now that OpenTelemetry's metrics package is stable.

jackwotherspoon commented 3 weeks ago

We will be migrating OpenTelemetry in the somewhat near future which will hopefully resolve this issue...