cloudfoundry / cf-deployment

The canonical open source deployment manifest for Cloud Foundry
Apache License 2.0
294 stars 306 forks source link

Missing client_auth property on the cc_logcache_tls certificate #1106

Closed amhuber closed 1 year ago

amhuber commented 1 year ago

What is this issue about?

When upgrading to the latest version of cf-deployment we were getting "stats server temporarily unavailable" errors from the CF CLI when running the "cf apps" command. We isolated this to an error from the CC connecting to log-cache to fetch stats. It appears that a recent change to log-cache (I'm assuming https://github.com/cloudfoundry/log-cache-release/pull/200) changed the mTLS behavior and the cc_logcache_tls certificate being used by the CC wasn't being accepted. After modify the certificate to add the client_auth usage it started working:

- name: cc_logcache_tls
  type: certificate
  update_mode: converge
  options:
    ca: log_cache_ca
    common_name: "api.((system_domain))"
    alternative_names:
    - "api.((system_domain))"
    - cloud-controller-ng.service.cf.internal
    extended_key_usage:
    - client_auth
    - server_auth

What version of cf-deployment are you using?

cf-deployment v31.2.0

chombium commented 1 year ago

Hi @amhuber,

We had the same issue reported once a month ago for cf-deployment v29.0.0, we've tried to understand what's going on and reproduce the issue, but the person reporting the problem didn't reply.

Can you please write some more info from which from which version of cf-deployment you are trying to upgrade to cf-deployment v31.2.0? That would be really helpful so that we can find a way to reproduce the problem and we can decide if addition of the extended_key_usage.

Have you check to see what's happening when you delete the server_auth list item? I think it should work without it as this certificate is used by the cloud controller when it connect to Log Cache to get the application container metrics.

amhuber commented 1 year ago

We are upgrading from v27.4.0 when we ran into the issue. When troubleshooting I was using curl to connect to the log-cache endpoint to test the certs and what I found is that is that on our environments still running v27.4.0 I can connect to log-cache without providing any client certs at all:

curl -v -k https://log-cache.service.cf.internal:8080
< HTTP/2 415
< content-type: application/grpc
< grpc-status: 3
< grpc-message: invalid gRPC request content-type ""

Trying the same test on v31.2.0 results in an "invalid certificate" error if I don't use a valid cert with client_auth, so the new update definitely fixed a vulnerability in log-cache.

I agree that removing server_auth from the cert should be fine since it appears that the cert is only used as a client cert (and technically the alternative names in the cert aren't needed either), but I didn't test it since I was just adding client_auth instead of changing server_auth to client_auth.

amhuber commented 1 year ago

Just to add, here is the test I was using on v31.2.0:

cd /var/vcap/jobs/cloud_controller_ng/config/certs
curl -v -k --cacert logcache_tls_ca.crt --cert logcache_tls.crt --key logcache_tls.key https://log-cache.service.cf.internal:8080

On v31.2.0 before adding the client_auth usage, this resulted in an "invalid certificate" error. Once client_auth was added re-running the command just gave the expected gRPC error which confirmed that mTLS is working.

chombium commented 1 year ago

Hi @amhuber,

I've been checking our foundations in the past few days and I haven't found any problems. We haven't skipped cf-deployment version when updating. The person reporting the same issue in the CF logging and metrics Slack channel has replied today and said that he has seen the same error when setting up a foundation from scratch.

At the moment I'm deploying CF from scratch, to see if I can reproduce the problem. IMO it won't be a problem if we add the extended_key_usage block. It will be good if we understand what causes the problem.

amhuber commented 1 year ago

It doesn't really make much sense how it's not broken in your environments. The cert is being used for client auth for mTLS, and if log-cache is working correctly then the cert shouldn't be allowed without the client_auth usage. So either you found a new security vulnerability (the cert is allowed and it shouldn't be), or your cert has the client_auth usage somehow even though it's not in the manifest. Check your cert on the CC and see if it has the usage:

Wrong (generated from the settings in the main branch):

openssl x509 -in /var/vcap/jobs/cloud_controller_ng/config/certs/logcache_tls.crt -text | grep -A1 Extended
        X509v3 Extended Key Usage:
            TLS Web Server Authentication

Correct (after adding client_auth):

openssl x509 -in /var/vcap/jobs/cloud_controller_ng/config/certs/logcache_tls.crt -text | grep -A1 Extended
        X509v3 Extended Key Usage:
            TLS Web Client Authentication, TLS Web Server Authentication
chombium commented 1 year ago

Thanks for the clarification. I will create a PR tomorrow

philippthun commented 1 year ago

@amhuber Where is the extended_key_usage server_auth coming from? I don't see it in the default manifest from cf-deployment. So when I look at the certificate in one of our foundations, I don't see any X509v3 Extended Key Usage section and the curl command succeeds.

chombium commented 1 year ago

Hi @philippthun, @amhuber added the extended_key_usage node by himself (ops file or manual cf-deployment.yml change) to test if with it everything will work. I've also checked multiple foundations where have the latest cf-deployment release running where this flag is not there and didn't notice any problems.

amhuber commented 1 year ago

Correct, I was just modifying our deployment manifest to add the client auth usage for testing. With the default manifest from the main branch that doesn't specify any usage properties BOSH will generate a cert with server_auth only, so I'm curious how you are getting a cert that doesn't have any usage parameter. I didn't even know that was possible.

For reference, here is the entire v3 section of my cert output:

    X509v3 extensions:
        X509v3 Key Usage: critical
            Digital Signature, Key Encipherment
        X509v3 Extended Key Usage:
            TLS Web Client Authentication, TLS Web Server Authentication
        X509v3 Basic Constraints: critical
            CA:FALSE
        X509v3 Subject Key Identifier:
            8F:14:F4:80:3B:AB:12:5E:6B:94:54:B6:D1:40:73:B1:F8:04:38:BD
        X509v3 Authority Key Identifier:
            E3:54:C9:F5:90:19:51:72:45:D0:53:3B:03:23:42:B3:0C:72:0F:F7
        X509v3 Subject Alternative Name:
            DNS:api.my.domain.com, DNS:cloud-controller-ng.service.cf.internal

Is your cert missing any of the other sections, or just the "extended key usage" section? If it's not specified I'm not sure what the default values are. What I'm wondering is if the difference is that my test environment is just BOSH without Credhub? Maybe if the usage isn't specified then Credhub doesn't add one at all, and BOSH defaults to just server_auth? If that is the case, then I'd say that Credhub is doing it wrong since it's dangerous to have a cert that doesn't have an explicit purpose set.

In any case, every other cert in the manifest that is used as a client cert has the client_auth usage (or both client and server) explicitly listed, except for this one, and the fact this one is missing it is wrong and inconsistent even if you can't reproduce the problem in your environment for some reason.

philippthun commented 1 year ago

... What I'm wondering is if the difference is that my test environment is just BOSH without Credhub? ...

Yes, that's the reason. I've created a cert with bosh interpolate and it automatically gets theTLS Web Server Authentication extension. Then I generated one with CredHub and it does not have this section.

So I asked ChatGPT and got the following answer:

In summary, while a certificate without the "X509v3 Extended Key Usage" section can still be used for TLS authentication in some scenarios, it's best to ensure that your certificates have the appropriate extensions, including "Extended Key Usage," to comply with security best practices and specific application requirements.

So we should definitely add this where missing as it is "more correct".

chombium commented 1 year ago

Mystery solved. Thank you very much both. I will prepare a PR

chombium commented 1 year ago

Hi @amhuber, @philippthun,

thank you very much for the patience and the tremendous help.

I've finally created a PR. Currently the CLA (Contributors License Agreement) check is failing as Aaron hasn't signed one. @amhuber could you please sign the CLA as individual contributor? It would be great to acknowledge your contribution to fixing this problem ;)

amhuber commented 1 year ago

I did have the previous CLA signed but it took a very long time for my company's legal department to approve it, and I'd have to start over with the new CLA. Feel free to just remove my name, no credit required. :-)