Venafi / vcert

Go client SDK and command line utility designed to simplify integrations by automating key generation and certificate enrollment using Venafi machine identity services.
https://support.venafi.com/hc/en-us/articles/217991528
Apache License 2.0
88 stars 64 forks source link

VCert's "auto-retry" feature (i.e., reset certificate if it is failed) causes a race condition in TPP, resulting in the error "unmatched key modulus" #273

Closed maelvls closed 1 year ago

maelvls commented 1 year ago

When using VCert v4.23.0, with TPP 22.4.1 (also tested with TPP 20.4.0), I often receive the following error message when requesting a certificate:

vcert error: your data contains problems: request doesn't match certificate: unmatched key modulus

I checked that the problem does not come from re-using the private key. I can confirm that the CSR and the issued certificate are mismatched.

The bug

Affected: TPP 22.4.1 and older, VCert 4.23.0, cert-manager 1.11.0 (just this one version), venafi-enhanced-issuer v0.2.0 and v0.3.0.

Fixed In: VCert v4.24.0, cert-manager v1.11.1 and 1.12.0, venafi-enhanced-issuer v0.3.1.

This bug systematically happens given the following circumstances:

  1. Only happens during renewal (does not happen when it is the initial enrollment).
  2. Only happens after a first renewal attempt (e.g., the CA was down).
  3. Only happens if the second renewal attempt fails (e.g., the CA was still down).

In real-world usage, that means that the workaround for "Click retry" introduced in #269 only "works" 50% of the time. This is better than before #269, since you were getting stuck with "Click retry" 100% of the time, but the error is now less descriptive.

Workaround: renew the certificate once again (given that this third attempt succeeds; otherwise, the fourth attempt will also fail, and so on).

When this bug occurs, VCert and cert-manager will show the following message:

request doesn't match certificate: unmatched key modulus

Workaround: re-renew the certificate (given that this third attempt succeeds; otherwise, the fourth attempt will also fail, and so on).

Problem

This unexpected behavior seems to happen when request and reset(restart=true) are called back to back. When that happens, TPP gives VCert an old certificate instead of returning a 500 error.

There seems to be a bad interaction between request and reset(restart=true). Note that the request we make in VCert are asynchronous (WorkToDoTimeout=0), and request never return 500s. Only retrieve calls may return a 500.

The following flow triggers the problem:

# Fresh certificate, CA is down.
request
reset(restart=true)
retrieve
# ❌ Returns 200 with the old certificate.

Even with a 1 second pause between requesting and resetting, the problem still occurs:

# Fresh certificate, CA is down.
request
sleep 1s
reset(restart=true)
retrieve
# ❌ Returns 200 with the old certificate.

We found that waiting for 5 seconds allows you to work around the problem. We also found that using reset(restart=false) before requesting doesn't trigger the problem.

# Fresh certificate, CA is down.
request
sleep 5s
reset(restart=true)
retrieve
# ✅ Returns 500 as expected.
# Fresh certificate, CA is down.
reset(restart=false)
request
retrieve
# ✅ The expected 500 HTTP code is returned.

Reproducing "unmatched key modulus" with vcert

First, set your .envrc:

#!/bin/bash
export TPP_URL=https://tpp.mael-valais-gcp.jetstacker.net
export TPP_USER=cert_manager
export TPP_PWD=$(lpass show -p tpp.mael-valais-gcp.jetstacker.net)
export TPP_CLIENT_ID=edit-mappings

Then, get a token:

TOKEN=$(vcert getcred -u $TPP_URL --username=$TPP_USER --password $TPP_PWD --client-id=$TPP_CLIENT_ID --scope=certificate:manage,revoke,delete --format json | tee /dev/stderr | jq -r .access_token) && export TOKEN

Then, make sure that the certificate doesn't already exist:

curl -D/dev/null -skSH "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -X DELETE $TPP_URL/vedsdk/certificates/$(curl -D/dev/null -skSH "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" $TPP_URL/vedsdk/config/dntoguid -d '{"ObjectDN":"\\VED\\Policy\\application-team-1\\app1.example.com"}' | tee /dev/stderr | jq .GUID -r | tr -d '{}')

Then, make sure that the AD CS service (i.e., Microsoft Certification Authority) is running:

gcloud compute ssh --project jetstack-mael-valais --zone europe-west1-c cert_manager@tpp -- net start certsvc

Then, enroll a certificate. It should succeed:

vcert enroll -u https://tpp.mael-valais-gcp.jetstacker.net -t "$TOKEN" --cn app1.example.com -z 'application-team-1' --san-dns=app1.example.com

Then, turn the CA off:

gcloud compute ssh --project jetstack-mael-valais --zone europe-west1-c cert_manager@tpp -- net stop certsvc

Then, enroll. It should show an error:

$ vcert enroll -u https://tpp.mael-valais-gcp.jetstacker.net -t "$TOKEN" --cn app1.example.com -z 'application-team-1' --san-dns=app1.example.com

Enter key passphrase:
Verifying - Enter key passphrase:
vCert: 2023/01/20 15:25:36 Successfully connected to Trust Protection Platform
vCert: 2023/01/20 15:25:36 Successfully read zone configuration for application-team-1
vCert: 2023/01/20 15:25:37 Successfully created request for app1.example.com
vCert: 2023/01/20 15:25:37 Successfully posted request for app1.example.com, will pick up by \VED\Policy\application-team-1\app1.example.com
vCert: 2023/01/20 15:25:37 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\application-team-1\app1.example.com has encountered an error while processing, Status: Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object., Stage: 500.

If you do it again, you will get the key modulus error:

$ vcert enroll -u https://tpp.mael-valais-gcp.jetstacker.net -t "$TOKEN" --cn app1.example.com -z 'application-team-1' --san-dns=app1.example.com

Enter key passphrase:
Verifying - Enter key passphrase:
vCert: 2023/01/20 15:26:51 Successfully connected to Trust Protection Platform
vCert: 2023/01/20 15:26:51 Successfully read zone configuration for application-team-1
vCert: 2023/01/20 15:26:51 Successfully created request for app1.example.com
vCert: 2023/01/20 15:26:52 Successfully posted request for app1.example.com, will pick up by \VED\Policy\application-team-1\app1.example.com
vCert: 2023/01/20 15:26:54 vcert error: your data contains problems: request doesn't match certificate: unmatched key modulus

Screenshot from 2023-01-24 13-55-43

Screenshot from 2023-01-24 13-53-09

vcert-request-reset-incorrect.har.zip

If you do it a third time, it will show the correct 500 error, since reset won't be called:

$ vcert enroll -u https://tpp.mael-valais-gcp.jetstacker.net -t "$TOKEN" --cn app1.example.com -z 'application-team-1' --san-dns=app1.example.com

Enter key passphrase:
Verifying - Enter key passphrase:
vCert: 2023/01/20 15:25:36 Successfully connected to Trust Protection Platform
vCert: 2023/01/20 15:25:36 Successfully read zone configuration for application-team-1
vCert: 2023/01/20 15:25:37 Successfully created request for app1.example.com
vCert: 2023/01/20 15:25:37 Successfully posted request for app1.example.com, will pick up by \VED\Policy\application-team-1\app1.example.com
vCert: 2023/01/20 15:25:37 unable to retrieve: Unexpected status code on TPP Certificate Retrieval. Status: 500 Certificate \VED\Policy\application-team-1\app1.example.com has encountered an error while processing, Status: Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object., Stage: 500.

If you call it a fourth time, it will show "unmatched key modulus" again, and so on and so forth.

Reproducing with curl

❌ Renewal of OK certificate. CA is down. Flow: request > reset(restart=true)

Occurence of this scenario in VCert: 100% of the time given the following circumstance:

  1. Only happens during renewal (does not happen when it is the initial enrollment).
  2. Only happens after a first renewal attempt.
  3. Only happens if the second renewal attempt fails.

There is an easy workaround: re-renewing the certificate (given that this third attempt succeeds; otherwise, the fourth attempt will also fail, and so on).

Before running this test, I turn the CA on, issue a certificate, and turn the CA off:

curl -D/dev/null -skSH "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -X DELETE $TPP_URL/vedsdk/certificates/$(curl -D/dev/null -skSH "Authorization: Bearer $TOKEN" -H
 "Content-Type: application/json" $TPP_URL/vedsdk/config/dntoguid -d '{"ObjectDN":"\\VED\\Policy\\application-team-1\\app1.example.com"}' | tee /dev/stderr | jq .GUID -r | tr -d '{}')
gcloud compute ssh --project jetstack-mael-valais --zone europe-west1-c cert_manager@tpp -- net start certsvc
vcert enroll -u $TPP_URL -t "$TOKEN" --cn app1.example.com -z 'application-team-1' --san-dns=app1.example.com
gcloud compute ssh --project jetstack-mael-valais --zone europe-west1-c cert_manager@tpp -- net stop certsvc

The actual test:

#!/bin/bash
openssl genrsa -out crt.key 2048
curl -X POST https://tpp.mael-valais-gcp.jetstacker.net/vedsdk/certificates/request -w ' %{http_code}\n' -skS -D/dev/null -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d@- <<EOF
{
  "PolicyDN": "\\\\VED\\\\Policy\\\\application-team-1",
  "CASpecificAttributes": [{ "Name": "Origin", "Value": "curl" }],
  "Origin": "curl",
  "PKCS10": $(step certificate create app1.example.com --san app1.example.com --csr --key crt.key /dev/stdout -f | jq -R --slurp),
  "KeyAlgorithm": "RSA",
  "KeyBitSize": 2048,
  "DisableAutomaticRenewal": true,
  "CADN":"\\\\VED\\\\Policy\\\\Administration\\\\msca"
}
EOF
curl -sS -D/dev/null -skSH "Authorization: Bearer $TOKEN" -w ' %{http_code}\n' -H "Content-Type: application/json" -o/dev/stdout https://tpp.mael-valais-gcp.jetstacker.net/vedsdk/certificates/reset -d '{"CertificateDN":"\\VED\\Policy\\application-team-1\\app1.example.com", "Restart":true}'
while :; do curl -sS -D/dev/null -w ' %{http_code}\n' -skSH "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -o/dev/stdout https://tpp.mael-valais-gcp.jetstacker.net/vedsdk/certificates/retrieve -d '{"CertificateDN":"\\VED\\Policy\\application-team-1\\app1.example.com", "Format": "base64", "IncludePrivateKey": false}'; sleep 1; done

For some reason, the retrieve call returns 200 OK instead of 500 Internal Server Error, and the returned certificate doesn't match the CSR. The certificate corresponds to the old certificate that was meant to be renewed:

{"CertificateDN":"\\VED\\Policy\\application-team-1\\app1.example.com","Guid":"{eea3a1db-0602-4be3-9a88-da1be4dcc855}"} 200
{"ProcessingResetCompleted":true,"RestartCompleted":true} 200
{"Stage":-1,"Status":"Queued for renewal"} 202
{"Stage":-1,"Status":"Queued for renewal"} 202
{"CertificateData":"LS0tLS1CRUdJTi...URS0tLS0tDQo=","Filename":"app1.example.com.cer","Format":"base64"} 200

✅ Renewal of OK certificate. CA is down. Flow: request > wait 5s > reset(restart=true)

Before running this test, I turn the CA on, issue a certificate, and turn the CA off:

curl -D/dev/null -skSH "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -X DELETE $TPP_URL/vedsdk/certificates/$(curl -D/dev/null -skSH "Authorization: Bearer $TOKEN" -H
 "Content-Type: application/json" $TPP_URL/vedsdk/config/dntoguid -d '{"ObjectDN":"\\VED\\Policy\\application-team-1\\app1.example.com"}' | tee /dev/stderr | jq .GUID -r | tr -d '{}')
gcloud compute ssh --project jetstack-mael-valais --zone europe-west1-c cert_manager@tpp -- net start certsvc
vcert enroll -u $TPP_URL -t "$TOKEN" --cn app1.example.com -z 'application-team-1' --san-dns=app1.example.com
gcloud compute ssh --project jetstack-mael-valais --zone europe-west1-c cert_manager@tpp -- net stop certsvc

Here is the actual test:

#!/bin/bash
openssl genrsa -out crt.key 2048
curl -X POST https://tpp.mael-valais-gcp.jetstacker.net/vedsdk/certificates/request -w ' %{http_code}\n' -skS -D/dev/null -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -d@- <<EOF
{
  "PolicyDN": "\\\\VED\\\\Policy\\\\application-team-1",
  "CASpecificAttributes": [{ "Name": "Origin", "Value": "curl" }],
  "Origin": "curl",
  "PKCS10": $(step certificate create app1.example.com --san app1.example.com --csr --key crt.key /dev/stdout -f | jq -R --slurp),
  "KeyAlgorithm": "RSA",
  "KeyBitSize": 2048,
  "DisableAutomaticRenewal": true,
  "CADN":"\\\\VED\\\\Policy\\\\Administration\\\\msca"
}
EOF
sleep 5
curl -sS -D/dev/null -skSH "Authorization: Bearer $TOKEN" -w ' %{http_code}\n' -H "Content-Type: application/json" -o/dev/stdout https://tpp.mael-valais-gcp.jetstacker.net/vedsdk/certificates/reset -d '{"CertificateDN":"\\VED\\Policy\\application-team-1\\app1.example.com", "Restart":true}'
while :; do curl -sS -D/dev/null -w ' %{http_code}\n' -skSH "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -o/dev/stdout https://tpp.mael-valais-gcp.jetstacker.net/vedsdk/certificates/retrieve -d '{"CertificateDN":"\\VED\\Policy\\application-team-1\\app1.example.com", "Format": "base64", "IncludePrivateKey": false}'; sleep 1; done

It errors as expected:

{"CertificateDN":"\\VED\\Policy\\application-team-1\\app1.example.com","Guid":"{5333ece9-50a6-474c-9c12-ecaef72868d7}"} 200
{"ProcessingResetCompleted":true,"RestartCompleted":true} 200
{"Stage":500,"Status":"Post CSR failed with error: Cannot connect to the certificate authority (CA). Verify that your CA template settings are correct and that the remote server is available. For more information, search the Help system for Configuring the Microsoft Certificate Services Template Object."} 500
achuchev commented 1 year ago

We found an issue on the Venafi TLS Protect DC side. We are checking if there is an acceptable workaround which can be used by vCert SDK.

luispresuelVenafi commented 1 year ago

I'll change the title to make it more suitable. Current title prints what our message when unmatching certificate and the private key are returned by any platform. We expect this error if something is wrong when cert key-pair is returned, and having it as a title doesn't highlight what the actual issue is.

luispresuelVenafi commented 1 year ago

Addressed in revert done in release v4.24.0

maelvls commented 1 year ago

Just to clarify, cert-manager 1.11 and 1.12 are not currently using the upstream version VCert. cert-manager 1.11 and 1.12 is relying on a fork (jetstack/vcert) that contains this fix that adds the "reset" operation as part of cert-manager's Venafi issuances.

Note that this fork is no longer used since cert-manager 1.13.

Here is the release note that was published in cert-manager 1.11.1 and 1.12.0 and in venafi-enhanced-issuer v0.3.1:

The auto-retry mechanism added in VCert 4.23.0 and part of cert-manager 1.11.0 (https://github.com/cert-manager/cert-manager/pull/5674) and venafi-enhanced-issuer v0.2.0 and v0.3.0 has been found to be faulty.

Until this issue is fixed upstream, we now use a patched version of VCert. This patch will slowdown the issuance of certificates by 9% in case of heavy load on TPP. We aim to release at an ulterior date a patch release of cert-manager to fix this slowdown.