[docs] - include instructions for getting the user code deployments working with istio

dagsir[bot] commented 2 years ago

Summary

Add a guide that explains how to successfully get user code deployments working with Istio.

Dagster Documentation Gap

This issue was generated from the slack conversation at: https://dagster.slack.com/archives/C01U954MEER/p1648700166288679?thread_ts=1648700166.288679&cid=C01U954MEER

Conversation excerpt

U03A5551T16: Hello Team, I’m looking for a guide on how to troubleshoot the connection to a deployment. I have dagster deployed on an AWS K8s (EKS) cluster. Everything worked 2 weeks ago. However, last week there was a redeployment of the cluster, so was my helm charts. After that, I got this error message on Dagit GUI:

grpc._channel._InactiveRpcError: &lt;_InactiveRpcError of RPC that terminated with: 
status = StatusCode.UNAVAILABLE details = "upstream connect error or disconnect/reset 
before headers. reset reason: protocol error" debug_error_string = "{"created":"@1648544039.239355713",
"description":"Error received from peer ipv4:172.20.8.123:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":
"upstream connect error or disconnect/reset before headers. reset reason: protocol error",
"grpc_status":14}" &gt;

From within the dagit pod, I could telnet to the deployment k8s service (telnet my_user_app_deployment_name 3030 --> connected ). May I get help on this issue? Thanks a lot.

U016C4E5CP8: Hi, if you go to the Workspace tab in dagit and press the reload button, do you still get the error? do you have a full stack trace for the error? U03A5551T16: Hi Daniel, Sorry for the late response (I'm in a bad timezone).

Reloading doesn't help. This is all what I could see from logs on dagit pod:

/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py:560: UserWarning: Error loading repository location user-code-example:dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 555, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 481, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/origin.py", line 291, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 526, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.7/site-packages/dagster/api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 164, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 110, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: &lt;_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "upstream connect error or disconnect/reset before headers. reset reason: protocol error"
    debug_error_string = "{"created":"@1648716422.905634674","description":"Error received from peer ipv4:172.20.39.151:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: protocol error","grpc_status":14}"
&gt;

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 107, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

  location_name=location_name, error_string=error.to_string()

U016C4E5CP8: if you redeploy the helm chart, does it still happen? U016C4E5CP8: Would it be possible to share or DM your values.yaml file? This is using the dagster helm chart? U016C4E5CP8: Lastly, what version were you on before, and what version are you on now when it is not? U03A5551T16: yes, I tried to uninstall both helm charts and reinstall them - no luck U03A5551T16: let me DM the values files U016C4E5CP8: Are there any clues in the logs for the user code deployment pod? U03A5551T16: no, there's no logs in that pod

2022-03-30 01:48:43 +0000 - dagster.code_server - INFO - Started Dagster code server for package analytx on port 3030 in process 1

U03A5551T16: is there any python code that I can run on the dagit pod to debug the connection? U03A5551T16: I'm using telnet to test only U016C4E5CP8: There's a dagster api grpc-health-check command that you could run on the dagit pod - e.g. dagster api grpc-health-check -p 4000 . It will raise an error if there's an issue connecting to the gRPC server on that port, and return cleanly if its able to connect U016C4E5CP8: curious what that command outputs on the dagit pod U03A5551T16:

root@dagster-dagit-76d57b9598-vvdpn:/# nslookup
&gt; k8s-swyftx-user-app
Server:     172.20.0.10
Address:    172.20.0.10#53

Name:   k8s-swyftx-user-app.dagster.svc.cluster.local
Address: 172.20.39.151
&gt; 
root@dagster-dagit-76d57b9598-vvdpn:/# telnet k8s-swyftx-user-app 3030
Trying 172.20.39.151...
Connected to k8s-swyftx-user-app.dagster.svc.cluster.local.
Escape character is '^]'.

^C
Connection closed by foreign host.
root@dagster-dagit-76d57b9598-vvdpn:/# ^C
root@dagster-dagit-76d57b9598-vvdpn:/# dagster api grpc-health-check -h k8s-swyftx-user-app -p 3030
&lt;_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "upstream connect error or disconnect/reset before headers. reset reason: protocol error"
    debug_error_string = "{"created":"@1648761065.717335601","description":"Error received from peer ipv4:172.20.39.151:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: protocol error","grpc_status":14}"
&gt;
root@dagster-dagit-76d57b9598-vvdpn:/#

U016C4E5CP8: And what exactly changed between when it was working before and when it stopped working? any details you can provide would help U016C4E5CP8: specific versions, etc. U03A5551T16: it was 0.14.3 when it was running last week U03A5551T16: when I reinstall the helms, it's 0.14.6 U03A5551T16: our platform team (who manages the EKS cluster) said they installed istio U016C4E5CP8: did any other changes happen in the cluster at the same time as the upgrade - are the user code deployments and dagit installed in the same cluster / expected to have network access to each other? U016C4E5CP8: installing istio seems like it could be related for sure U03A5551T16: I'm not aware of any other changes. The services are expected to have network access to each other. U016C4E5CP8: Here's a thread with some other folks who ran into connection errors related to istio - https://dagster.slack.com/archives/C01U954MEER/p1634042336128100 U03A5551T16: I also suspect that was Istio. However, telnet shows that the networks are connected U016C4E5CP8: I'm looking through this issue which seems possibly relevant https://github.com/istio/istio/issues/27513 U016C4E5CP8: but doesn't have an obvious resolution U03A5551T16: thanks Dan U03A5551T16: let me come back to talk to the EKS team U03A5551T16: if you have some python code that I could use to mimic grpc-health-check, that would be great U016C4E5CP8: Here's the implementation of the health check command: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/cli/api.py#L615-L626 U03A5551T16: Thanks! U03A5551T16: <@U016C4E5CP8> I got a bit of progress with this issue. According to our K8s team, the service's port name is incorrect (?). The correct name should be grpc instead of http. https://github.com/dagster-io/dagster/blob/3b55c4e864775b7a70ed8ff539629317a1202505/helm/dagster/charts/dagster-user-deployments/templates/service-user.yaml#L18-L23|https://github.com/dagster-io/dagster/blob/3b55c4e864775b7a70ed8ff539629317a120250[…]ter/charts/dagster-user-deployments/templates/service-user.yaml After having changed this, my dagit pod could connect to the dagster-app service. I'm not sure whether this service port name is something agreed globally, or it's only an Istio implementation. U03A5551T16: Anyway, after that issue has been cleared, I stumped upon the same issue mentioned in this thread https://dagster.slack.com/archives/C01U954MEER/p1635400200206600 U016C4E5CP8: Ah great! We can include that in the docs for future people who run into this U016C4E5CP8: <@U018K0G2Y85> docs include instructions for getting the user code deployments working with istio

Message from the maintainers:

Are you looking for the same documentation content? Give it a :thumbsup:. We factor engagement into prioritization.

erinkcochran87 commented 1 year ago

@gibsondan D'you know if this guy is still relevant? I think you helped this user originally

cyberjar09 commented 5 months ago

hi what was the resolution here? I dont see anything relevant in the docs nor slack

cberge908 commented 5 months ago

Hey folks, I guess this is still relevant. We're also facing a similar situation where we can't connect to the code server. The name of the service port is correct (grpc), so this is not the solution.

Just to proof that it's istio/envoy, we have disabled the istio injection for dagster - webserver was able to connect to the code server without any issues.

When trying the grpc connection check (after enabling the istio injection again) we're seeing this:

root@dagster-dagster-webserver-6bf7db84dc-q7tmj:/# dagster api grpc-health-check -p 4000 -h custom-code-server
<_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNKNOWN
    details = "Missing :te header"
    debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-04-08T12:47:34.180461915+00:00", grpc_status:2, grpc_message:"Missing :te header"}"
>
Unable to connect to gRPC server: 0

When checking the web it might have something to do with the CORS settings. Anyone else seen this message with Missing :te header?

/edit: Also, just for the bigger picture, we have enforced mTLS peerauthentication in our whole service mesh. Besides this we're not having any special settings (only minimum TLS version v1.3).

Cheers, @cberge908

dagster-io / dagster