dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.29k stars 1.42k forks source link

Readonly webserver not recovering from reloading temp. unavailable code location #22511

Open alexknorr opened 3 months ago

alexknorr commented 3 months ago

Dagster version

1.7.9

What's the issue?

A code location container (pod) is updated trough rolling (spinning up new and then remove old), dagster-webserver started with --read-only flag in a separate pod gets an LocationStateChangeEventType.LOCATION_UPDATED event and tries to reload, but if the code location is probably not available under the old grpc connection for a short time, it fails and does not recover (does no retries). In that case dagster-webserver has to be restarted manually to recover.

2024-06-12 18:03:34 +0000 - dagster-webserver - INFO - Received LocationStateChangeEventType.LOCATION_UPDATED event for location computation, refreshing
/dagster/venv/lib/python3.10/site-packages/dagster/_core/workspace/context.py:641: UserWarning: Error loading repository location computation:dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

Stack Trace:
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_core/workspace/context.py", line 636, in _load_location
    else origin.create_location(self.instance)
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_core/remote_representation/origin.py", line 364, in create_location
    return GrpcServerCodeLocation(self, instance=instance)
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_core/remote_representation/code_location.py", line 643, in __init__
    self.server_id = server_id if server_id else sync_get_server_id(self.client)
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_api/get_server_id.py", line 15, in sync_get_server_id
    result = check.inst(api_client.get_server_id(), (str, SerializableErrorInfo))
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 233, in get_server_id
    res = self._query("GetServerId", api_pb2.Empty, timeout=timeout)
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 173, in _query
    self._raise_grpc_exception(
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 156, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.25.214.184:4000: Failed to connect to remote host: Connection refused"
    debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:10.25.214.184:4000: Failed to connect to remote host: Connection refused {created_time:"2024-06-12T18:03:34.997906873+00:00", grpc_status:14}"
>

Stack Trace:
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 171, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/dagster/venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 141, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/dagster/venv/lib/python3.10/site-packages/grpc/_channel.py", line 1161, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/dagster/venv/lib/python3.10/site-packages/grpc/_channel.py", line 1004, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable

  warnings.warn(f"Error loading repository location {location_name}:{error.to_string()}")

What did you expect to happen?

The webserver to recover from temp. unavailable code locations in read-only mode.

How to reproduce?

No response

Deployment type

Other

Deployment details

Custom k8s deployment on open-shift with dagster-webserver, daemon and code locations in separate pods.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

alexknorr commented 3 months ago

Could it be related to browser caching? I had a case where a code location was newly deployed and marked failed in the read-only UI, a reload did not change anything. After clearing the edge cache and reload, the code location failure status was gone and showed the newest image version.