Closed amhuber closed 1 year ago
Hi @amhuber,
We had the same issue reported once a month ago for cf-deployment v29.0.0
, we've tried to understand what's going on and reproduce the issue, but the person reporting the problem didn't reply.
Can you please write some more info from which from which version of cf-deployment you are trying to upgrade to cf-deployment v31.2.0
? That would be really helpful so that we can find a way to reproduce the problem and we can decide if addition of the extended_key_usage
.
Have you check to see what's happening when you delete the server_auth
list item? I think it should work without it as this certificate is used by the cloud controller when it connect to Log Cache to get the application container metrics.
We are upgrading from v27.4.0 when we ran into the issue. When troubleshooting I was using curl to connect to the log-cache endpoint to test the certs and what I found is that is that on our environments still running v27.4.0 I can connect to log-cache without providing any client certs at all:
curl -v -k https://log-cache.service.cf.internal:8080
< HTTP/2 415
< content-type: application/grpc
< grpc-status: 3
< grpc-message: invalid gRPC request content-type ""
Trying the same test on v31.2.0 results in an "invalid certificate" error if I don't use a valid cert with client_auth, so the new update definitely fixed a vulnerability in log-cache.
I agree that removing server_auth from the cert should be fine since it appears that the cert is only used as a client cert (and technically the alternative names in the cert aren't needed either), but I didn't test it since I was just adding client_auth instead of changing server_auth to client_auth.
Just to add, here is the test I was using on v31.2.0:
cd /var/vcap/jobs/cloud_controller_ng/config/certs
curl -v -k --cacert logcache_tls_ca.crt --cert logcache_tls.crt --key logcache_tls.key https://log-cache.service.cf.internal:8080
On v31.2.0 before adding the client_auth usage, this resulted in an "invalid certificate" error. Once client_auth was added re-running the command just gave the expected gRPC error which confirmed that mTLS is working.
Hi @amhuber,
I've been checking our foundations in the past few days and I haven't found any problems. We haven't skipped cf-deployment version when updating. The person reporting the same issue in the CF logging and metrics Slack channel has replied today and said that he has seen the same error when setting up a foundation from scratch.
At the moment I'm deploying CF from scratch, to see if I can reproduce the problem. IMO it won't be a problem if we add the extended_key_usage
block. It will be good if we understand what causes the problem.
It doesn't really make much sense how it's not broken in your environments. The cert is being used for client auth for mTLS, and if log-cache is working correctly then the cert shouldn't be allowed without the client_auth usage. So either you found a new security vulnerability (the cert is allowed and it shouldn't be), or your cert has the client_auth usage somehow even though it's not in the manifest. Check your cert on the CC and see if it has the usage:
Wrong (generated from the settings in the main branch):
openssl x509 -in /var/vcap/jobs/cloud_controller_ng/config/certs/logcache_tls.crt -text | grep -A1 Extended
X509v3 Extended Key Usage:
TLS Web Server Authentication
Correct (after adding client_auth):
openssl x509 -in /var/vcap/jobs/cloud_controller_ng/config/certs/logcache_tls.crt -text | grep -A1 Extended
X509v3 Extended Key Usage:
TLS Web Client Authentication, TLS Web Server Authentication
Thanks for the clarification. I will create a PR tomorrow
@amhuber Where is the extended_key_usage
server_auth
coming from? I don't see it in the default manifest from cf-deployment. So when I look at the certificate in one of our foundations, I don't see any X509v3 Extended Key Usage
section and the curl
command succeeds.
Hi @philippthun, @amhuber added the extended_key_usage
node by himself (ops file or manual cf-deployment.yml change) to test if with it everything will work. I've also checked multiple foundations where have the latest cf-deployment release running where this flag is not there and didn't notice any problems.
Correct, I was just modifying our deployment manifest to add the client auth usage for testing. With the default manifest from the main branch that doesn't specify any usage properties BOSH will generate a cert with server_auth only, so I'm curious how you are getting a cert that doesn't have any usage parameter. I didn't even know that was possible.
For reference, here is the entire v3 section of my cert output:
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Client Authentication, TLS Web Server Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Key Identifier:
8F:14:F4:80:3B:AB:12:5E:6B:94:54:B6:D1:40:73:B1:F8:04:38:BD
X509v3 Authority Key Identifier:
E3:54:C9:F5:90:19:51:72:45:D0:53:3B:03:23:42:B3:0C:72:0F:F7
X509v3 Subject Alternative Name:
DNS:api.my.domain.com, DNS:cloud-controller-ng.service.cf.internal
Is your cert missing any of the other sections, or just the "extended key usage" section? If it's not specified I'm not sure what the default values are. What I'm wondering is if the difference is that my test environment is just BOSH without Credhub? Maybe if the usage isn't specified then Credhub doesn't add one at all, and BOSH defaults to just server_auth? If that is the case, then I'd say that Credhub is doing it wrong since it's dangerous to have a cert that doesn't have an explicit purpose set.
In any case, every other cert in the manifest that is used as a client cert has the client_auth usage (or both client and server) explicitly listed, except for this one, and the fact this one is missing it is wrong and inconsistent even if you can't reproduce the problem in your environment for some reason.
... What I'm wondering is if the difference is that my test environment is just BOSH without Credhub? ...
Yes, that's the reason. I've created a cert with bosh interpolate
and it automatically gets theTLS Web Server Authentication
extension. Then I generated one with CredHub and it does not have this section.
So I asked ChatGPT and got the following answer:
In summary, while a certificate without the "X509v3 Extended Key Usage" section can still be used for TLS authentication in some scenarios, it's best to ensure that your certificates have the appropriate extensions, including "Extended Key Usage," to comply with security best practices and specific application requirements.
So we should definitely add this where missing as it is "more correct".
Mystery solved. Thank you very much both. I will prepare a PR
Hi @amhuber, @philippthun,
thank you very much for the patience and the tremendous help.
I've finally created a PR. Currently the CLA (Contributors License Agreement) check is failing as Aaron hasn't signed one. @amhuber could you please sign the CLA as individual contributor? It would be great to acknowledge your contribution to fixing this problem ;)
I did have the previous CLA signed but it took a very long time for my company's legal department to approve it, and I'd have to start over with the new CLA. Feel free to just remove my name, no credit required. :-)
What is this issue about?
When upgrading to the latest version of cf-deployment we were getting "stats server temporarily unavailable" errors from the CF CLI when running the "cf apps" command. We isolated this to an error from the CC connecting to log-cache to fetch stats. It appears that a recent change to log-cache (I'm assuming https://github.com/cloudfoundry/log-cache-release/pull/200) changed the mTLS behavior and the cc_logcache_tls certificate being used by the CC wasn't being accepted. After modify the certificate to add the client_auth usage it started working:
What version of cf-deployment are you using?
cf-deployment v31.2.0