alexandrevilain / temporal-operator

Temporal Kubernetes Operator
https://temporal-operator.pages.dev/
Apache License 2.0
148 stars 31 forks source link

Can't connect via client to frontend service with cert-manager mTLS certificate #722

Open andrewbelu opened 2 months ago

andrewbelu commented 2 months ago

Hey,

I've been trying to get mTLS up and running on my Temporal deployment. I have enabled mTLS on both internode communication and frontend communication. I have deployed the Temporal cluster like so (omitted extraneous data):

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: temporal-cluster
  namespace: temporal
spec:
  mTLS:
    provider: cert-manager
    internode:
      enabled: true
    frontend:
      enabled: true
    certificatesDuration:
      clientCertificates: 48h0m0s
      frontendCertificate: 48h0m0s
      intermediateCAsCertificates: 128h0m0s
      internodeCertificate: 48h0m0s
      rootCACertificate: 256h0m0s
    refreshInterval: 1h0m0s
    renewBefore: 2h0m0s

I then created a TemporalClusterClient to get a certificate signed by the frontend intermediate CA in the test namespace:

apiVersion: temporal.io/v1beta1
kind: TemporalClusterClient
metadata:
  name: example-worker
  namespace: test
spec:
  clusterRef:
    name: temporal-cluster
    namespace: temporal

The secret is provisioned correctly into the test namespace. I then mount that secret into my pod (other data omitted for brevity):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-worker
  namespace: test
spec:
  template:
    spec:
      containers:
        - name: worker
          image: ...
          env:
            - name: TEMPORAL_ADDRESS
              value: temporal-cluster-frontend.temporal.svc.cluster.local:7233
          volumeMounts:
            - mountPath: "/var/temporal/certs"
              name: temporal-certs
              readOnly: true
      volumes:
        - name: temporal-certs
          secret:
            secretName: temporal-cluster-example-worker-mtls-certificate

I get a bad certificate error when attempting to connect with the certificate:

Traceback (most recent call last):
  File "/app/worker.py", line 83, in <module>
    loop.run_until_complete(main())
  File "/usr/local/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/worker.py", line 53, in main
    client = await Client.connect(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/temporalio/client.py", line 164, in connect
    await temporalio.service.ServiceClient.connect(connect_config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The relevant worker code:

  certs_directory = os.environ.get("TEMPORAL_CERTS_DIRECTORY", "/var/temporal/certs")
    with open(os.path.join(certs_directory, "tls.crt"), 'rb') as f:
        client_cert = f.read()
    with open(os.path.join(certs_directory, "tls.key"), 'rb') as f:
        client_key = f.read()
    with open(os.path.join(certs_directory, "ca.crt"), 'rb') as f:
        ca_cert = f.read()

    # Connect client
    client = await Client.connect(
        os.environ.get("TEMPORAL_ADDRESS", "localhost:7233"),
        namespace="default",
        tls=TLSConfig(
            client_cert=client_cert,
            client_private_key=client_key,
            server_root_ca_cert=ca_cert
        )
    )

I've also tried remove the server_root_ca_cert option and still get errors. However with exactly the same setup, if I replace the cert generated by the TemporalClusterClient with the frontend-intermediate certificate secret (in the temporal namespace, just copied over), everything works just fine.

Running an openssl s_client results in a similar story: With the TemporalClusterClient generated certificate:

openssl s_client -connect temporal-cluster-frontend.temporal.svc.cluster.local:7233 -cert tls.crt -key tls.key -CAfile ca.crt
    Verify return code: 20 (unable to get local issuer certificate)

With the frontend intermediate:

openssl s_client -connect temporal-cluster-frontend.temporal.svc.cluster.local:7233 -cert tls.crt -key tls.key -CAfile ca.crt
    Verify return code: 0 (ok)

Any ideas? I am scratching my head trying to figure out what I might be doing wrong here.

alexandrevilain commented 2 months ago

Hi! Which version are you using ?

andrewbelu commented 2 months ago

Hey, I am using version v0.18.0 of the operator. ghcr.io/alexandrevilain/temporal-operator:v0.18.0

alexandrevilain commented 1 month ago

Hi @andrewbelu !

This may be an issue with https://github.com/alexandrevilain/temporal-operator/pull/715. Could you please try with v0.17.0 ?

andrewbelu commented 1 month ago

@alexandrevilain Hello! Tried with v0.17 of the operator and same deal.

Here is the info of the certificate (omitted unnecessary details):

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            d3:b1:80:b7:89:71:af:d7:d8:9c:0b:66:82:77:3c:67
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Frontend intermediate CA certificate
        Validity
            Not Before: Jun  4 18:29:09 2024 GMT
            Not After : Jun  6 18:29:09 2024 GMT
        Subject: CN = example-worker client certificate
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (4096 bit)
                Modulus:
                    ...
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Authority Key Identifier: 
                5E:9F:23:BA:83:22:89:07:79:D4:16:BA:0B:2D:75:35:45:23:C7:91
            X509v3 Subject Alternative Name: 
                DNS:example-worker.temporal-cluster.temporal.svc.cluster.local
    Signature Algorithm: sha256WithRSAEncryption
    Signature Value:
        ...

Perhaps it's the SAN? I notice that it's giving a different namespace for the worker than the one the worker pod is actually in, but I am unsure if this is intended or not.

X509v3 Subject Alternative Name: 
                DNS:example-worker.temporal-cluster.temporal.svc.cluster.local

I should add the original Python error (forgot to copy paste that):

RuntimeError: Failed client connect: Server connection error: tonic::transport::Error(Transport, hyper::Error(Connect, Custom { kind: InvalidData, error: InvalidCertificate(NotValidForName) }))
alexandrevilain commented 1 month ago

Hi @andrewbelu !

Sorry for the late reply, I'm trying to reproduce your issue, but it works well on my side.

Here are the steps I followed:

kubectl apply -f examples/cluster-mtls/00-namespace.yaml
kubectl apply -f examples/cluster-mtls/01-postgresql.yaml
kubectl apply -f examples/cluster-mtls/02-temporal-cluster.yaml
# waiting for the cluster to be up and running
kubectl apply -f examples/cluster-mtls/03-temporal-cluster-client.yaml
kubectl cert-manager inspect secret -n demo prod-my-worker-mtls-certificate # using cert-manager kubectl plugin
# exporting certificates
kubectl view-secret prod-my-worker-mtls-certificate -n demo tls.key > /tmp/tls.key 
kubectl view-secret prod-my-worker-mtls-certificate -n demo tls.crt > /tmp/tls.crt 
kubectl view-secret prod-my-worker-mtls-certificate -n demo ca.crt > /tmp/ca.crt
# exporting SERVER_NAME
export SERVER_NAME=$(kubectl get temporalclusterclient my-worker -o=template="{{.status.serverName}}")
# on another shell:
kubectl port-forward service/prod-frontend -n demo 7233:7233
# then same test:
openssl s_client -connect localhost:7233 -cert /tmp/tls.crt -key /tmp/tls.key -CAfile /tmp/ca.crt -servername $SERVER_NAME

Here is the result I get:

Connecting to ::1
CONNECTED(00000005)
depth=2 CN=Root CA certificate
verify return:1
depth=1 CN=Frontend intermediate CA certificate
verify return:1
depth=0 CN=Frontend Certificate
verify return:1
---
Certificate chain
 0 s:CN=Frontend Certificate
   i:CN=Frontend intermediate CA certificate
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 13 14:20:27 2024 GMT; NotAfter: Jun 13 15:20:27 2024 GMT
 1 s:CN=Frontend intermediate CA certificate
   i:CN=Root CA certificate
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 13 14:20:07 2024 GMT; NotAfter: Jun 13 15:50:07 2024 GMT
---
Server certificate
-----BEGIN CERTIFICATE-----
OMITED
-----END CERTIFICATE-----
subject=CN=Frontend Certificate
issuer=CN=Frontend intermediate CA certificate
---
Acceptable client certificate CA names
CN=Root CA certificate
Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 3652 bytes and written 2352 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256
Server public key is 4096 bit
This TLS version forbids renegotiation.
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)

Is there something I'm missing to reproduce your issue ?