Open dagsir[bot] opened 2 years ago
@gibsondan D'you know if this guy is still relevant? I think you helped this user originally
hi what was the resolution here? I dont see anything relevant in the docs nor slack
Hey folks, I guess this is still relevant. We're also facing a similar situation where we can't connect to the code server. The name of the service port is correct (grpc), so this is not the solution.
Just to proof that it's istio/envoy, we have disabled the istio injection for dagster - webserver was able to connect to the code server without any issues.
When trying the grpc connection check (after enabling the istio injection again) we're seeing this:
root@dagster-dagster-webserver-6bf7db84dc-q7tmj:/# dagster api grpc-health-check -p 4000 -h custom-code-server
<_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Missing :te header"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-04-08T12:47:34.180461915+00:00", grpc_status:2, grpc_message:"Missing :te header"}"
>
Unable to connect to gRPC server: 0
When checking the web it might have something to do with the CORS settings. Anyone else seen this message with Missing :te header?
/edit: Also, just for the bigger picture, we have enforced mTLS peerauthentication in our whole service mesh. Besides this we're not having any special settings (only minimum TLS version v1.3).
Cheers, @cberge908
Summary
Add a guide that explains how to successfully get user code deployments working with Istio.
Dagster Documentation Gap
This issue was generated from the slack conversation at: https://dagster.slack.com/archives/C01U954MEER/p1648700166288679?thread_ts=1648700166.288679&cid=C01U954MEER
Conversation excerpt
U03A5551T16: Hello Team, I’m looking for a guide on how to troubleshoot the connection to a deployment. I have dagster deployed on an AWS K8s (EKS) cluster. Everything worked 2 weeks ago. However, last week there was a redeployment of the cluster, so was my helm charts. After that, I got this error message on Dagit GUI:
From within the dagit pod, I could telnet to the deployment k8s service (
telnet my_user_app_deployment_name 3030
-->connected
). May I get help on this issue? Thanks a lot.U016C4E5CP8: Hi, if you go to the Workspace tab in dagit and press the reload button, do you still get the error? do you have a full stack trace for the error? U03A5551T16: Hi Daniel, Sorry for the late response (I'm in a bad timezone).
Reloading doesn't help. This is all what I could see from logs on dagit pod:
U016C4E5CP8: if you redeploy the helm chart, does it still happen? U016C4E5CP8: Would it be possible to share or DM your values.yaml file? This is using the dagster helm chart? U016C4E5CP8: Lastly, what version were you on before, and what version are you on now when it is not? U03A5551T16: yes, I tried to uninstall both helm charts and reinstall them - no luck U03A5551T16: let me DM the values files U016C4E5CP8: Are there any clues in the logs for the user code deployment pod? U03A5551T16: no, there's no logs in that pod
U03A5551T16: is there any python code that I can run on the dagit pod to debug the connection? U03A5551T16: I'm using
telnet
to test only U016C4E5CP8: There's adagster api grpc-health-check
command that you could run on the dagit pod - e.g.dagster api grpc-health-check -p 4000
. It will raise an error if there's an issue connecting to the gRPC server on that port, and return cleanly if its able to connect U016C4E5CP8: curious what that command outputs on the dagit pod U03A5551T16:U016C4E5CP8: And what exactly changed between when it was working before and when it stopped working? any details you can provide would help U016C4E5CP8: specific versions, etc. U03A5551T16: it was 0.14.3 when it was running last week U03A5551T16: when I reinstall the helms, it's 0.14.6 U03A5551T16: our platform team (who manages the EKS cluster) said they installed istio U016C4E5CP8: did any other changes happen in the cluster at the same time as the upgrade - are the user code deployments and dagit installed in the same cluster / expected to have network access to each other? U016C4E5CP8: installing istio seems like it could be related for sure U03A5551T16: I'm not aware of any other changes. The services are expected to have network access to each other. U016C4E5CP8: Here's a thread with some other folks who ran into connection errors related to istio - https://dagster.slack.com/archives/C01U954MEER/p1634042336128100 U03A5551T16: I also suspect that was Istio. However, telnet shows that the networks are connected U016C4E5CP8: I'm looking through this issue which seems possibly relevant https://github.com/istio/istio/issues/27513 U016C4E5CP8: but doesn't have an obvious resolution U03A5551T16: thanks Dan U03A5551T16: let me come back to talk to the EKS team U03A5551T16: if you have some python code that I could use to mimic grpc-health-check, that would be great U016C4E5CP8: Here's the implementation of the health check command: https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/cli/api.py#L615-L626 U03A5551T16: Thanks! U03A5551T16: <@U016C4E5CP8> I got a bit of progress with this issue. According to our K8s team, the service's port name is incorrect (?). The correct name should be
grpc
instead ofhttp
. https://github.com/dagster-io/dagster/blob/3b55c4e864775b7a70ed8ff539629317a1202505/helm/dagster/charts/dagster-user-deployments/templates/service-user.yaml#L18-L23|https://github.com/dagster-io/dagster/blob/3b55c4e864775b7a70ed8ff539629317a120250[…]ter/charts/dagster-user-deployments/templates/service-user.yaml After having changed this, my dagit pod could connect to the dagster-app service. I'm not sure whether this service port name is something agreed globally, or it's only an Istio implementation. U03A5551T16: Anyway, after that issue has been cleared, I stumped upon the same issue mentioned in this thread https://dagster.slack.com/archives/C01U954MEER/p1635400200206600 U016C4E5CP8: Ah great! We can include that in the docs for future people who run into this U016C4E5CP8: <@U018K0G2Y85> docs include instructions for getting the user code deployments working with istioMessage from the maintainers:
Are you looking for the same documentation content? Give it a :thumbsup:. We factor engagement into prioritization.