Deadline exceeded in gRPC calls

jwayne commented 10 months ago

When running the OpenTelemetry metrics/trace exporters in a Flask service on Google Cloud Run, I've been getting a sizable volume of errors (~60/day) that look like the following:

ERROR:opentelemetry.exporter.cloud_monitoring:Error while writing to Cloud Monitoring
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 75, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = "Deadline Exceeded"
    debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2023-12-20T00:51:23.257021748+00:00", grpc_status:4}"

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/opentelemetry/exporter/cloud_monitoring/__init__.py", line 361, in export
    self._batch_write(all_series)
  File "/usr/local/lib/python3.10/site-packages/opentelemetry/exporter/cloud_monitoring/__init__.py", line 145, in _batch_write
    self.client.create_time_series(
  File "/usr/local/lib/python3.10/site-packages/google/cloud/monitoring_v3/services/metric_service/client.py", line 1452, in create_time_series
    rpc(
  File "/usr/local/lib/python3.10/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/google/api_core/timeout.py", line 120, in func_with_timeout
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/google/api_core/grpc_helpers.py", line 77, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.DeadlineExceeded: 504 Deadline Exceeded

It looks like these are the result of gRPC calls that timed out (which were triggered by the Cloud Monitoring metrics or trace exporter). FWIW, it's odd that the timeout is being hit in the first place, since I'm running this in GCP.

Two observations:

It's been hard to debug which gRPC call is failing, or what the offending timeout is, because the log message doesn't explain either. Could we make the error message more descriptive, perhaps including more context about the failing gRPC call?
It appears that many of these timeouts are hardcoded here, suggesting this isn't user error since the timeouts aren't configurable by the user.

Glad to hear any suggestions.

aabmass commented 9 months ago

Hi @jwayne, we've seen timeouts like this in Cloud Run when containers have their CPU throttled in the middle of an export. When the CPU comes back, enough time has passed that the request times out. Using CPU Always Allocated would probably fix the issue but understand if you don't want to do this.

You mentioned this is ~60/day, do you know the overall error rate for DEADLINE_EXCEEDED in your service? You can also try using an OpenTelemetry collector for trace, but it probably won't completely fix the issue.

punya commented 5 months ago

Closing this because the customer hasn't responded in a few months.

GoogleCloudPlatform / opentelemetry-operations-python

Deadline exceeded in gRPC calls #304