alexandrevilain / temporal-operator

Temporal Kubernetes Operator
https://temporal-operator.pages.dev/
Apache License 2.0
164 stars 37 forks source link

Can't connect via client to frontend service with cert-manager mTLS certificate #722

Open andrewbelu opened 6 months ago

andrewbelu commented 6 months ago

Hey,

I've been trying to get mTLS up and running on my Temporal deployment. I have enabled mTLS on both internode communication and frontend communication. I have deployed the Temporal cluster like so (omitted extraneous data):

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: temporal-cluster
  namespace: temporal
spec:
  mTLS:
    provider: cert-manager
    internode:
      enabled: true
    frontend:
      enabled: true
    certificatesDuration:
      clientCertificates: 48h0m0s
      frontendCertificate: 48h0m0s
      intermediateCAsCertificates: 128h0m0s
      internodeCertificate: 48h0m0s
      rootCACertificate: 256h0m0s
    refreshInterval: 1h0m0s
    renewBefore: 2h0m0s

I then created a TemporalClusterClient to get a certificate signed by the frontend intermediate CA in the test namespace:

apiVersion: temporal.io/v1beta1
kind: TemporalClusterClient
metadata:
  name: example-worker
  namespace: test
spec:
  clusterRef:
    name: temporal-cluster
    namespace: temporal

The secret is provisioned correctly into the test namespace. I then mount that secret into my pod (other data omitted for brevity):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-worker
  namespace: test
spec:
  template:
    spec:
      containers:
        - name: worker
          image: ...
          env:
            - name: TEMPORAL_ADDRESS
              value: temporal-cluster-frontend.temporal.svc.cluster.local:7233
          volumeMounts:
            - mountPath: "/var/temporal/certs"
              name: temporal-certs
              readOnly: true
      volumes:
        - name: temporal-certs
          secret:
            secretName: temporal-cluster-example-worker-mtls-certificate

I get a bad certificate error when attempting to connect with the certificate:

Traceback (most recent call last):
  File "/app/worker.py", line 83, in <module>
    loop.run_until_complete(main())
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/worker.py", line 53, in main
    client = await Client.connect(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/temporalio/client.py", line 164, in connect
    await temporalio.service.ServiceClient.connect(connect_config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The relevant worker code:

  certs_directory = os.environ.get("TEMPORAL_CERTS_DIRECTORY", "/var/temporal/certs")
    with open(os.path.join(certs_directory, "tls.crt"), 'rb') as f:
        client_cert = f.read()
    with open(os.path.join(certs_directory, "tls.key"), 'rb') as f:
        client_key = f.read()
    with open(os.path.join(certs_directory, "ca.crt"), 'rb') as f:
        ca_cert = f.read()

    # Connect client
    client = await Client.connect(
        os.environ.get("TEMPORAL_ADDRESS", "localhost:7233"),
        namespace="default",
        tls=TLSConfig(
            client_cert=client_cert,
            client_private_key=client_key,
            server_root_ca_cert=ca_cert
        )
    )

I've also tried remove the server_root_ca_cert option and still get errors. However with exactly the same setup, if I replace the cert generated by the TemporalClusterClient with the frontend-intermediate certificate secret (in the temporal namespace, just copied over), everything works just fine.

Running an openssl s_client results in a similar story: With the TemporalClusterClient generated certificate:

openssl s_client -connect temporal-cluster-frontend.temporal.svc.cluster.local:7233 -cert tls.crt -key tls.key -CAfile ca.crt
    Verify return code: 20 (unable to get local issuer certificate)

With the frontend intermediate:

openssl s_client -connect temporal-cluster-frontend.temporal.svc.cluster.local:7233 -cert tls.crt -key tls.key -CAfile ca.crt
    Verify return code: 0 (ok)

Any ideas? I am scratching my head trying to figure out what I might be doing wrong here.

alexandrevilain commented 6 months ago

Hi! Which version are you using ?

andrewbelu commented 6 months ago

Hey, I am using version v0.18.0 of the operator. ghcr.io/alexandrevilain/temporal-operator:v0.18.0

alexandrevilain commented 6 months ago

Hi @andrewbelu !

This may be an issue with https://github.com/alexandrevilain/temporal-operator/pull/715. Could you please try with v0.17.0 ?

andrewbelu commented 5 months ago

@alexandrevilain Hello! Tried with v0.17 of the operator and same deal.

Here is the info of the certificate (omitted unnecessary details):

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            d3:b1:80:b7:89:71:af:d7:d8:9c:0b:66:82:77:3c:67
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Frontend intermediate CA certificate
        Validity
            Not Before: Jun  4 18:29:09 2024 GMT
            Not After : Jun  6 18:29:09 2024 GMT
        Subject: CN = example-worker client certificate
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (4096 bit)
                Modulus:
                    ...
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Authority Key Identifier: 
                5E:9F:23:BA:83:22:89:07:79:D4:16:BA:0B:2D:75:35:45:23:C7:91
            X509v3 Subject Alternative Name: 
                DNS:example-worker.temporal-cluster.temporal.svc.cluster.local
    Signature Algorithm: sha256WithRSAEncryption
    Signature Value:
        ...

Perhaps it's the SAN? I notice that it's giving a different namespace for the worker than the one the worker pod is actually in, but I am unsure if this is intended or not.

X509v3 Subject Alternative Name: 
                DNS:example-worker.temporal-cluster.temporal.svc.cluster.local

I should add the original Python error (forgot to copy paste that):

RuntimeError: Failed client connect: Server connection error: tonic::transport::Error(Transport, hyper::Error(Connect, Custom { kind: InvalidData, error: InvalidCertificate(NotValidForName) }))
alexandrevilain commented 5 months ago

Hi @andrewbelu !

Sorry for the late reply, I'm trying to reproduce your issue, but it works well on my side.

Here are the steps I followed:

kubectl apply -f examples/cluster-mtls/00-namespace.yaml
kubectl apply -f examples/cluster-mtls/01-postgresql.yaml
kubectl apply -f examples/cluster-mtls/02-temporal-cluster.yaml
# waiting for the cluster to be up and running
kubectl apply -f examples/cluster-mtls/03-temporal-cluster-client.yaml
kubectl cert-manager inspect secret -n demo prod-my-worker-mtls-certificate # using cert-manager kubectl plugin
# exporting certificates
kubectl view-secret prod-my-worker-mtls-certificate -n demo tls.key > /tmp/tls.key 
kubectl view-secret prod-my-worker-mtls-certificate -n demo tls.crt > /tmp/tls.crt 
kubectl view-secret prod-my-worker-mtls-certificate -n demo ca.crt > /tmp/ca.crt
# exporting SERVER_NAME
export SERVER_NAME=$(kubectl get temporalclusterclient my-worker -o=template="{{.status.serverName}}")
# on another shell:
kubectl port-forward service/prod-frontend -n demo 7233:7233
# then same test:
openssl s_client -connect localhost:7233 -cert /tmp/tls.crt -key /tmp/tls.key -CAfile /tmp/ca.crt -servername $SERVER_NAME

Here is the result I get:

Connecting to ::1
CONNECTED(00000005)
depth=2 CN=Root CA certificate
verify return:1
depth=1 CN=Frontend intermediate CA certificate
verify return:1
depth=0 CN=Frontend Certificate
verify return:1
---
Certificate chain
 0 s:CN=Frontend Certificate
   i:CN=Frontend intermediate CA certificate
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 13 14:20:27 2024 GMT; NotAfter: Jun 13 15:20:27 2024 GMT
 1 s:CN=Frontend intermediate CA certificate
   i:CN=Root CA certificate
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 13 14:20:07 2024 GMT; NotAfter: Jun 13 15:50:07 2024 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
OMITED
-----END CERTIFICATE-----
subject=CN=Frontend Certificate
issuer=CN=Frontend intermediate CA certificate
---
Acceptable client certificate CA names
CN=Root CA certificate
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 3652 bytes and written 2352 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
Server public key is 4096 bit
This TLS version forbids renegotiation.
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)

Is there something I'm missing to reproduce your issue ?

bmorton commented 3 days ago

I am troubleshooting a similar issue and figured I'd weigh in here so we can make progress on #746. @andrewbelu, in your error I see error: InvalidCertificate(NotValidForName), which makes me think it's the server name. Temporal's Go example for this also doesn't include the server name, so I was running into the same issue. Once I added it, things started working for me:

temporalHostPort := os.Getenv("TEMPORAL_ADDRESS")
temporalNamespace := os.Getenv("TEMPORAL_NAMESPACE")
temporalTLSCert := os.Getenv("TEMPORAL_TLS_CERT")
temporalTLSKey := os.Getenv("TEMPORAL_TLS_KEY")
temporalTLSCACert := os.Getenv("TEMPORAL_TLS_CA_CERT")
temporalTLSServerName := os.Getenv("TEMPORAL_TLS_SERVER_NAME")

serverCAPool := x509.NewCertPool()
b, err := os.ReadFile(temporalTLSCACert)
if err != nil {
    log.Fatalln("Unable to read server CA certificate", err)
}
if !serverCAPool.AppendCertsFromPEM(b) {
    log.Fatalln("Unable to append server CA certificate to pool")
}

clientOptions := client.Options{
    HostPort:          temporalHostPort,
    ConnectionOptions: client.ConnectionOptions{
        TLS: &tls.Config{
            GetClientCertificate: func(info *tls.CertificateRequestInfo) (*tls.Certificate, error) {
                cert, err := tls.LoadX509KeyPair(temporalTLSCert, temporalTLSKey)
                if err != nil {
                    return nil, err
                }
                return &cert, nil
            },
            RootCAs:    serverCAPool,
            ServerName: temporalTLSServerName,
        },
    },
    Namespace:         temporalNamespace,
}

c, err := client.Dial(clientOptions)
if err != nil {
    log.Fatalln("Unable to create client", err)
}
defer c.Close()

Looking at the Python docs, it looks like there's a domain property that could be set. Does setting that fix your issue?

My issue is that once certificates expire, things don't seem to refresh, which I'll follow up in the relevant issue.