Closed gberche-orange closed 11 months ago
@gberche-orange great job on the analysis! @bgandon has volunteered during the Foundational Infrastructure working group meeting to make PRs to change all occurrences of USA to US.
great thanks @rkoster for the update and reminder about the working group meeting. We'll try to join next time we'll submit issues/prs.
Thinking it through, my initial proposal to introduce an opt-in property is unnecessary. The USA -> US country code change should not have negative side effects for bosh deployments that currently support zero-downtime certificate rotation using certs generated by bosh interpolate
or bosh config servers: the update procedure already support having distinct certificates (N and N+1) with distinct Subject
(with USA or US country code).
Thanks @bgandon for your proposal to submit PRs to change country code from USA to US in bosh, much appreciated !
As a first step for fix, I’ve contributed the cloudfoundry/config-server#17 PR. This change will then have to be pulled by Bosh CLI.
CredHub server and CLI don’t seem to be affected.
With Bosh Director though, I’m concerned by NATS client-certs. Indeed, it looks like these certs are generated with the 3-letters USA
country code, and trusted with that exact country code.
See https://github.com/cloudfoundry/bosh/blob/main/jobs/nats/templates/nats.cfg.erb#L10-L32.
Fixing this might require doing it in two distinct phases, with proper transition for trusting both types of generated certificates, with either US
or USA
country code. Don’t hesitate to bring more context here @rkoster or @jpalermo for a better understanding on the NATS client certs generation and trust mechanisms.
Following up on this, if we re-generate a NATS CA with subject fixed (using proper X.520 country code, based on 2-letters country codes from ISO 3166), the problem with the new NATS CA re-generated by CredHub is the same as when the NATS CA expires and needs to be rotated.
For the time being, it’s still unclear to me how the NATS CA can safely be rotated, so I need more input to understand what is scenario is already supported and how.
Indeed, I imagine the Director has to “see” that the NATS CA has changed, and re-generate all NATS client certificates for Agents, then push these through the live “update VM settings” mechanism.… that unfortunately relies on NATS. This would require the NATS server to temporarily trust both the old and the new NATS CA (because Agent are still using old client certs to connect to NATS and receive updated VM settings), and I’m not sure if there is another safe way, so I need to do more investigations on that.
Then, there is also the question of eliminating the countryName
(C) “RelativeDistinguishedName” (or “RDN”, as the RFC says) from the certificates subject. So far, I see no reason why the countryName
RDN would be mandatory there (and the RFC 2253 pointed out this afternoon doesn’t state anything like that as being mandatory, nor the RFC 2459 that was mentioned above, and might also be more relevant on this) correct me if I’m wrong. I need to do some test on some Bosh and switch the Director to generating NATS certs without the countryName
RDN and see what happens.
Finally, @rkoster has mentioned today in the CFF Foundational Infrastructure meeting the TLS Authentication chapter of the NATS documentation, which links to the RFC 2253. It’s interesting to see in section 2.2 and later in the examples that a built-in feature is to be able to “multi valuate” RDN, separating them with a +
sign.
This means that we could possibly trust certificates with Subject: C=US+C=USA, O=Cloud Foundry, CN=bla-bla-bla
, or try something in that direction.
On the subject of multi-valued RDNs like C=US+C=USA
, well it doesn’t mean “US
or USA
” but instead means “US
and USA
”. So, that would not help us.
More interesting, the NATS client certs don’t need a countryName
(C) RDN (formal verification below). But the challenge is to synchronize the Subject of the NATS certificates (generated by the Director) with the usernames put in the NATS config (by the separate bosh-nats-sync
).
If we suddenly change the NATS client certs subject in the Director’s code, then we need:
sv
will restart the Agent until it successfully connects.) In order to synchronise the Director with the external NATS config generator, then the NATS usernames can no more be a shared convention between these separate components, and shall be transmitted by the Director to the NATS config generator. This may involve storing the NATS username in the database, which involves tough refactoring with the only small benefit of aligning the countryName
(C) RDN with RFC.Instead, the NATS client certs could be re-generated as part of a normal NATS CA rotation process, as documented in “Rotating NATS Certificate Authorities”.
My advice would be that the generate_nats_client_certificate(common_name) method (in NatsClientCertGenerator
class) is modified in order to grab the countryName
(C) RDN out of the NATS CA, and use it in the generated NATS client certificates.
Then the agent_user(agent_id, cn) method (in the NATSSync::NatsAuthConfig
class) would need to know the countryName
(C) RDN from the CA subject in order to put the correct user names in the generate NATS config. Such transmission of information is already done for the Director and Health Monitor NATS usernames. Adding a third one for NATS CA countryName is affordable.
This way, the NATS client certs re-generation (with the correct countryName
(C) RDN) would be synchronized with the NATS CA change, and no new trigger in the Director would be necessary. Operators would do the operation at their own pace, following the usual NATS CA rotation process when the time has come for them to do so.
Code ref.: short-lived NATS credentials: https://github.com/cloudfoundry/bosh/commit/dec31de320fcd29a574db8685f6abf697138f788
On a running Director VM, patch both “director” and “nats-sync” Gems.
# cd /var/vcap/packages/director
# patch -p1 # paste the 1st patch below then type Control-D (possibly two times)
--- director/gem_home/ruby/3.2.0/gems/bosh-director-0.0.0/lib/bosh/director/nats_client_cert_generator.rb 2023-11-07 11:53:28.089549055 +0000
+++ director/gem_home/ruby/3.2.0/gems/bosh-director-0.0.0/lib/bosh/director/nats_client_cert_generator.new.rb 2023-11-07 11:52:24.473567658 +0000
@@ -35,7 +35,7 @@
cert.serial = SecureRandom.hex(16).to_i(16)
- cert.subject = OpenSSL::X509::Name.parse "/C=USA/O=Cloud Foundry/CN=#{common_name}"
+ cert.subject = OpenSSL::X509::Name.parse "/O=Cloud Foundry/CN=#{common_name}"
cert.issuer = @root_ca.subject # root CA is the issuer
cert.public_key = key.public_key
cert.not_before = Time.now
^D
# cd /var/vcap/packages/nats
# patch -p1 # paste the 2nd patch below then type Control-D
--- nats/gem_home/ruby/3.2.0/gems/bosh-nats-sync-0.0.0/lib/nats_sync/nats_auth_config.rb 2023-10-28 23:13:14.000000000 +0000
+++ nats/gem_home/ruby/3.2.0/gems/bosh-nats-sync-0.0.0/lib/nats_sync/nats_auth_config.new.rb 2023-11-07 12:09:33.345254409 +0000
@@ -30,7 +30,7 @@
def agent_user(agent_id, cn)
{
- 'user' => "C=USA, O=Cloud Foundry, CN=#{cn}.agent.bosh-internal",
+ 'user' => "O=Cloud Foundry, CN=#{cn}.agent.bosh-internal",
'permissions' => {
'publish' => [
"hm.agent.heartbeat.#{agent_id}",
^D
Before running monit restart bosh_nats_sync
, one can inspect some NATS client certificate
$ bosh ssh scratchpad/0 -d scratchpad
$ sudo apt update -qq && sudo apt install -y -qq jq
$ sudo jq -r .env.bosh.mbus.cert.certificate /var/vcap/bosh/settings.json | openssl x509 -noout -subject
subject=C = USA, O = Cloud Foundry, CN = 09b9b58e-96ad-4f9f-b0dd-4e078b23ea9e.bootstrap.agent.bosh-internal
$ exit
On the director restart director
and bosh_nats_sync
monit processes:
# monit restart director ; monit restart bosh_nats_sync ; watch -n1 monit summary # wait until both have restarted
The regenerated NATS config in /var/vcap/data/nats/auth.json
makes all agents suddenly be irresponsive.
Try re-creating a VM.
$ bosh recreate --fix scratchpad/0 -d scratchpad --non-interactive
...
$ bosh ssh scratchpad/0 -d scratchpad
$ sudo apt update -qq && sudo apt install -y -qq jq
$ sudo jq -r .env.bosh.mbus.cert.certificate /var/vcap/bosh/settings.json | openssl x509 -noout -subject
subject=O = Cloud Foundry, CN = 67aead65-2985-43cb-ab63-191df827c373.bootstrap.agent.bosh-internal
$ exit
It works without the countryCode (C) RDN.
Director can get back to its former state applying the reversed patches with patch -p1 -R
.
Thanks a lot @bgandon, @rkoster and @jpalermo for your work on this issue, and sorry we were enable to participate in related discussions in the infrastructure working group meeting.
Within orange, your fix will enable the bosh directors created using bosh create-env
to have valid x509 certs with 2 digits country codes.
From @jpalermo analysis https://github.com/cloudfoundry/config-server/pull/17#pullrequestreview-1733458704
I believe we decided there was no risk to these changes.
This code is pulled in by the bosh-cli and it will change how certificate variables are generated when doing a bosh create-env, but that will at most impact the CA/server cert generated for NATS, not any of the client certs, and the clients don't use the country for any sort of server validation.
I understand that this change will not imply unresponsive agents to these upgraded directors.
At orange, the bosh directors not deployed using bosh create-env
(i.e. "nested bosh directors"), already use certificates patched to not include the country code in their subject. This enables operators to choose on emergency to renew the certificates using openssl cli (without rotating the private key and just extending the expiration date) without going through the full procedure documented at https://bosh.io/docs/nats-ca-rotation/ which requires two redeployments of each bosh deployment. This method, while being less secure than changing the private key, enables to us to avoid hitting the expired cert condition, especially on directors with a large number of deployments and vms, where the recovery through the deployment recreation has too heavy operational impact.
Expected behavior
As a bosh user, In order to work with certificate generated by bosh interpolate I need the certificates to be compliant to specs where country code should be 2 digits
From RFC 2459 page 73 (and also page 98):
Observed behavior in bosh variables
Bosh config server https://bosh.io/docs/director-certs/ creates certificates where the country code is USA (3 digits) and thus invalid.
https://github.com/cloudfoundry/config-server/blob/1133d48ad7894760d86dc59c4bcdf02d20541870/types/certificate_generator.go#L187-L199
As a result, tools such as openssl improperly handle them, in particular when computing their Subject key identifier from their Subject: The invalid Country=USA (3 digits) is excluded. This prevents regenerating new certs with new expiration dates using openssl.
Note: I'm not yet clear on how the invalid credhub certificate request is accepted by credhub which explicitly rejects country code without 2 digits. Possibly introduced in recent credhub version not yet leveraged by the version used by bosh deployment ?
https://github.com/cloudfoundry/credhub/blob/10f0365b913799e9fa931ddf438926bb3edc569c/components/credentials/src/main/kotlin/org/cloudfoundry/credhub/requests/CertificateGenerationRequestParameters.kt#L136
Proposed fix in bosh variables
Add support for specifying country code in the variables option https://bosh.io/docs/director-certs/ as to enable opt-in for valid C=US instead of invalid C=USA
https://github.com/cloudfoundry/bosh-cli/blob/1a5b8fa77d38050e89b137f834760213ce04312c/vendor/github.com/cloudfoundry/config-server/types/certificate_generator.go#L28-L37
/CC @ogrand
Observed behavior in bosh nats certs
Nats x509 certs also seem to have invalif 3 digits USA country in Subject
https://github.com/cloudfoundry/bosh/blob/e4a7ff6cdfedbd7d35a411f52ff7ce08acae3214/src/bosh-director/lib/bosh/director/nats_client_cert_generator.rb#L38-L38