flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.69k stars 640 forks source link

[BUG] flyte-binary chart: GRPC routes with traefik not working as expected #4962

Closed ldunkum closed 5 months ago

ldunkum commented 8 months ago

Describe the bug

We're using traefik as an Ingress controller in our EKS cluster and are deploying the flyte-binary chart. The flyte-binary-grpc service has the annotation traefik.ingress.kubernetes.io/service.serversscheme: h2c, which according to traefik docs should be fine to serve GRPC traffic.

Testing the GRPC endpoints with curl works fine, e.g.:
curl -v -X POST --http2 'https://flyte.example.com/grpc.health.v1.Health' -d "" -H 'Content-Type: application/grpc' -H 'Accept: application/grpc'

< HTTP/2 200
< content-type: application/grpc

Trying to run a workflow fails with the following error message:

> pyflyte -v run -r  test2/workflows/example.py wf

E0227 10:25:36.676138107  567282 ssl_transport_security.cc:1511]       Handshake failed with fatal error SSL_ERROR_SSL: error:10000410:SSL routines:OPENSSL_internal:SSLV3_ALERT_HANDSHAKE_FAILURE.
E0227 10:25:36.703445362  567282 ssl_transport_security.cc:1511]       Handshake failed with fatal error SSL_ERROR_SSL: error:10000410:SSL routines:OPENSSL_internal:SSLV3_ALERT_HANDSHAKE_FAILURE.
E0227 10:25:36.728929063  567282 ssl_transport_security.cc:1511]       Handshake failed with fatal error SSL_ERROR_SSL: error:10000410:SSL routines:OPENSSL_internal:SSLV3_ALERT_HANDSHAKE_FAILURE.
Verbose mode on
╭───────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:315 in continuation                                                                                                        │
│                                                                                                                                                                                                                     │
│ ❱ 315 │   │   │   │   response, call = self._thunk(new_method).with_call(                                                                                                                                           │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:343 in with_call                                                                                                           │
│                                                                                                                                                                                                                     │
│ ❱ 343 │   │   return self._with_call(                                                                                                                                                                               │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:332 in _with_call                                                                                                          │
│                                                                                                                                                                                                                     │
│ ❱ 332 │   │   return call.result(), call                                                                                                                                                                            │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_channel.py:437 in result                                                                                                                  │
│                                                                                                                                                                                                                     │
│ ❱  437 │   │   raise self                                                                                                                                                                                           │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:315 in continuation                                                                                                        │
│                                                                                                                                                                                                                     │
│ ❱ 315 │   │   │   │   response, call = self._thunk(new_method).with_call(                                                                                                                                           │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:343 in with_call                                                                                                           │
│                                                                                                                                                                                                                     │
│ ❱ 343 │   │   return self._with_call(                                                                                                                                                                               │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:332 in _with_call                                                                                                          │
│                                                                                                                                                                                                                     │
│ ❱ 332 │   │   return call.result(), call                                                                                                                                                                            │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_channel.py:437 in result                                                                                                                  │
│                                                                                                                                                                                                                     │
│ ❱  437 │   │   raise self                                                                                                                                                                                           │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_interceptor.py:315 in continuation                                                                                                        │
│                                                                                                                                                                                                                     │
│ ❱ 315 │   │   │   │   response, call = self._thunk(new_method).with_call(                                                                                                                                           │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_channel.py:1177 in with_call                                                                                                              │
│                                                                                                                                                                                                                     │
│ ❱ 1177 │   │   return _end_unary_response_blocking(state, call, True, None)                                                                                                                                         │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/grpc/_channel.py:1003 in _end_unary_response_blocking                                                                                           │
│                                                                                                                                                                                                                     │
│ ❱ 1003 │   │   raise _InactiveRpcError(state)  # pytype: disable=not-instantiable                                                                                                                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:1.2.3.4:443: Ssl handshake failed: SSL_ERROR_SSL: error:10000410:SSL routines:OPENSSL_internal:SSLV3_ALERT_HANDSHAKE_FAILURE"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-02-27T10:25:36.729366014+01:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNKNOWN: 
ipv4:1.2.3.4:443: Ssl handshake failed: SSL_ERROR_SSL: error:10000410:SSL routines:OPENSSL_internal:SSLV3_ALERT_HANDSHAKE_FAILURE"}"

The above exception was the direct cause of the following exception:

╭───────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/user/.local/bin/pyflyte:8 in <module>                                                                                                                                                                        │
│                                                                                                                                                                                                                     │
│ ❱ 8 │   sys.exit(main())                                                                                                                                                                                            │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/click/core.py:1157 in __call__                                                                                                                  │
│                                                                                                                                                                                                                     │
│ ❱ 1157 │   │   return self.main(*args, **kwargs)                                                                                                                                                                    │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/rich_click/rich_command.py:126 in main                                                                                                          │
│                                                                                                                                                                                                                     │
│ ❱ 126 │   │   │   │   │   rv = self.invoke(ctx)                                                                                                                                                                     │
│                                                                                                                                                                                                                     │
│ /home/user/.local/pipx/venvs/flytekit/lib/python3.11/site-packages/flytekit/clis/sdk_in_container/utils.py:144 in invoke                                                                                           │
│                                                                                                                                                                                                                     │
│ ❱ 144 │   │   │   │   │   raise e.with_traceback(None)                                                                                                                                                              │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
FlyteSystemException

However, I can create a project from the CLI:

flytectl create project --file new-test-proj.yaml

{"json":{"src":"client.go:63"},"level":"info","msg":"Initialized Admin client","ts":"2024-02-27T09:46:53+01:00"}
{"json":{"src":"auth_interceptor.go:86"},"level":"debug","msg":"Request failed due to [rpc error: code = Unauthenticated desc = token parse error [JWT_VERIFICATION_FAILED] Could not retrieve id token from metadata, caused by: rpc error: code = Unauthenticated desc = Request unauthenticated with IDToken]. If it's an unauthenticated error, we will attempt to establish an authenticated context.","ts":"2024-02-27T09:46:53+01:00"}
{"json":{"src":"auth_interceptor.go:91"},"level":"debug","msg":"Request failed due to [Unauthenticated]. Attempting to establish an authenticated connection and trying again.","ts":"2024-02-27T09:46:53+01:00"}
{"json":{"src":"base_token_orchestrator.go:63"},"level":"info","msg":"found the token in the cache","ts":"2024-02-27T09:46:53+01:00"}
project created successfully.

Expected behavior

I'd expect the CLI calls to work without error.

Additional context to reproduce

#config.yaml
admin:
  endpoint: dns:///flyte.example.com
  authType: Pkce # default 
  insecure: false
  insecureSkipVerify: true

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot[bot] commented 8 months ago

🤖

Based on the information you've provided, it seems like the handshake failure error you're experiencing when trying to run a workflow with the flyte-binary chart and traefik as an Ingress controller in your EKS cluster could be due to the fact that gRPC requires HTTP/2, and not all Ingress controllers support HTTP/2 for backend services.

In the provided context, the Ingress configuration for Flyte is set up to handle both HTTP and gRPC routes. However, not all Ingress controllers can handle both HTTP 1 and gRPC with a single Ingress because gRPC can only be enabled on the Ingress object, not on backend services. This is mentioned in the comments of the provided context:

# Certain ingress controllers like nginx cannot serve HTTP 1 and GRPC with a single ingress because GRPC can only
# enabled on the ingress object, not on backend services (GRPC annotation is set on the ingress, not on the services).

Traefik, the Ingress controller you're using, supports HTTP/2, but it needs to be enabled explicitly. If it's not enabled, you might experience handshake failure errors when trying to run gRPC workflows.

To resolve this issue, you need to ensure that HTTP/2 is enabled in your Traefik configuration. If you're using Traefik v2.x, you can enable HTTP/2 by adding the following to your Traefik configuration:

entryPoints:
  web:
    address: ":80"
    http:
      http2: true

Please adjust the configuration according to your setup. If you're using a different version of Traefik or have a different setup, the configuration might be different. Please refer to the Traefik documentation for more details.

Regarding the insecureSkipVerify configuration in the Flyte CLI, it is a boolean flag that controls whether a client verifies the server's certificate chain and host name during a gRPC call. If insecureSkipVerify is set to true, the client will not verify the server's certificate chain and host name, which can be insecure and is not recommended for production use cases. This configuration is particularly useful in development or testing environments where you might be using self-signed certificates. In such cases, you can set insecureSkipVerify to true to bypass the SSL certificate verification.

I hope this helps! If you have any other questions or need further clarification, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
davidmirror-ops commented 7 months ago

I'm not sure Flyte supports Traefik, especially due to the absence of some of the annotations required by Flyte. I think @flixr managed to make it work but not with `flyte-binary?

ldunkum commented 7 months ago

Thanks for the reply @davidmirror-ops!
I was under the impression that some folks managed to get a deployment with Traefik working in the Slack channel.

Can you point to the annotations that Flyte requires that aren't supported by Traefik?

flixr commented 7 months ago

Hey @ldunkum, we have flyte (flyte-core) running with traefik. When I set it up, I used the ingress rules that were created, removed the grpc stuff and created a IngressRoute for the grpc endpoint:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: flyte-core-grpc
  namespace: flyte
spec:
  entryPoints:
    - web
    - websecure
  routes:
  - kind: Rule
    match: Host(`flyte.k3s.roboception.de`) && (PathPrefix(`/flyteidl.{service:.*}`) || PathPrefix(`/grpc.health.v1.Health`))
    services:
      - kind: Service
        name: flyteadmin
        namespace: flyte
        port: grpc
        scheme: h2c
ldunkum commented 7 months ago

Hey @flixr, thanks for your reply. As far as I can tell, we tested basically the same configuration:

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: flyte-binary-grpc-ingressroute
spec:
  entryPoints:
  - web
  - websecure
  routes:
  - kind: Rule
    match: Host(`flyte.example.com`) && PathPrefix(`/grpc.health.v1.Health/*`)
    services:
    - kind: Service
      name: flyte-flyte-binary-grpc
      namespace: dev-flyte
      port: grpc
      scheme: h2c
      passHostHeader: true #default

As our configuration doesn't work, there is probably something else going on, perhaps our Traefik values are different?

These are (most of) our Traefik values ```yaml ports: web: # enable redirect from http to https on every route # https://docs.traefik.io/routing/entrypoints/#redirection redirectTo: port: websecure # Trust our origins regarding X-Forwarded headers # yamllint disable-line rule:line-length # cf. https://doc.traefik.io/traefik/v2.3/routing/entrypoints/#forwarded-headers forwardedHeaders: trustedIPs: ["${cidr_range}"] proxyProtocol: trustedIPs: ["${cidr_range}"] websecure: # Trust our origins regarding X-Forwarded headers # yamllint disable-line rule:line-length # cf. https://doc.traefik.io/traefik/v2.3/routing/entrypoints/#forwarded-headers forwardedHeaders: trustedIPs: ["${cidr_range}"] proxyProtocol: trustedIPs: ["${cidr_range}"] middlewares: - system-security-headers@kubernetescrd # Set secure cipher suites and minimum TLS version # They also publish a best practices guide here: # https://github.com/ssllabs/research/wiki/SSL-and-TLS-Deployment-Best-Practices # You can use Mozilla's config generator as a good starting point: # https://ssl-config.mozilla.org/#server=traefik&version=2.1.2&config=intermediate&guideline=5.6 tlsOptions: default: minVersion: VersionTLS12 cipherSuites: - "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256" - "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256" - "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384" - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384" - "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305" - "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305" # These are neccessary for Win7 users with IE11 # when the cert is signed with RSA. - "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA" - "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA" curvePreferences: - CurveP521 - CurveP384 sniStrict: true ```

If you spot something, please let me know, the help would be greatly appreciated!

flixr commented 7 months ago

I'm running traefik 2.9.6 atm.. But it looks like your IngressRoute is incomplete, it should also match

PathPrefix(`/flyteidl.{service:.*}`
ldunkum commented 7 months ago

We're using traefik v2.11.0, and we hadn't looked much into IngressRoutes before, so we simply used different rules for each path. Your solution is much cleaner, so we'll switch to that.

An example for a path config we used:

  - kind: Rule
    match: Host(`flyte.example.com`) && PathPrefix(`/flyteidl.service.AdminService/*`)
    services:
    - kind: Service
      name: flyte-flyte-binary-grpc
      namespace: dev-flyte
      port: grpc
      scheme: h2c
      passHostHeader: true
davidmirror-ops commented 7 months ago

@ldunkum is Traefik working on your env?

ldunkum commented 7 months ago

Traefik in general is working great, it's been deployed for a few years, and we have multiple ingresses that are working flawlessly. It's still not working with flyte however, we will look at fixing that during the next two weeks and perhaps try switching to the flyte-core chart.

Jeinhaus commented 6 months ago

We actually got this working thanks to this thread in the Traefik support forums. The cipher suites we had were not compatible with the grpc client flyte uses.

davidmirror-ops commented 6 months ago

@Jeinhaus thanks for confirming. Any chance you could share the final ingress config you used to make it work with Flyte? Just in case others find this thread useful.

Jeinhaus commented 6 months ago

@davidmirror-ops yes. I'll try to get some PRs open for this, because it was not only traefik's tlsOptions but also a missing grpc service in the flyte-core chart. For now, the tlsOptions we used that worked were:

    tlsOptions:
      default:
        minVersion: VersionTLS12
        cipherSuites:
        - "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA"
        - "TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA"
        - "TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256"
        - "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"
        - "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
        - "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305"
        - "TLS_CHACHA20_POLY1305_SHA256"
        - "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"
        - "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
        - "TLS_FALLBACK_SCSV"
        # Important for GRPC, see
        # https://community.traefik.io/t/how-to-disable-two-ciphersuites-and-tls1-1-without-breaking-grpc/17647/5
        - "TLS_AES_128_GCM_SHA256"
        - "TLS_AES_256_GCM_SHA384"
        - "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
        - "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
        - "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
        # These are neccessary for Win7 users with IE11
        - "TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA"
        - "TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA"
        sniStrict: true
        curvePreferences: []

The important part was curvePreferences: []. Without that, the grpc calls failed. Is there something that can be done on flyte's side to not require this @davidmirror-ops ?

ldunkum commented 5 months ago

Something of note for anyone that comes across this in the future:

Traefik v3 removed regex matching from PathPrefix, therefore we switched to using PathRegexp.