Closed 7ing closed 2 years ago
We likely want to do this for all terminal states (i.e Failed, and perhaps unknown readiness conditions (though they should be impossible)). We can write to the metadata file with a configurable nextIssuanceTime since both a failed and denied request may be transient. A default of 1h may be appropriate (?).
wdyt?
Currently the denied request is handled separately from failed requests. I agree some terminal states are transient, like timeout or manual approval. They could be retried with exp-back-off logic. But for invalid CSRs, perhaps retry won't help here.
Also, from what I have observed, the retries for denied requests do not have exp-back-off logic. Maybe because they are recreated ones. Worth to check as well.
The issue with failing permanently is that we can't bubble this up to the Kubelet (which is invoking the CSI plugin), meaning that if a failure does occur then the pod will never start and will be 'stuck'.
The rationale behind re-creating the CertificateRequest is that a future CSR might be approved (even if it is being denied now). For example, if a user has deployed an always-deny approver, sets up the CSI driver and hits this case, they would then go and fix up their approval flow so that these CSRs do get approved in future, and the retrying behaviour means that the pod will then start.
We certainly should reduce how often we retry though - ideally we'd exponentially back-off (but still continue to retry) and surface this back to the user by returning an error message in the Publish
call.
To properly manage exponential backoff though, we will probably need to change how NodePublishVolume behaves on failure. We can probably re-use some of the codepaths we created for the "continue on not ready" feature to do this, but will need to also record the number of failed attempts in the metadata store too to properly compute the next "wait period".
We do already perform exponential back-off on retries, which you can see here: https://github.com/cert-manager/csi-lib/blob/c9135148ecc6a25820bae8ea25c1b6d631a90ca8/manager/manager.go#L498-L520
However, for the initial NodePublishVolume call we set a timeout of 30s for the initial issuance: https://github.com/cert-manager/csi-lib/blob/c9135148ecc6a25820bae8ea25c1b6d631a90ca8/driver/nodeserver.go#L51
This means that after 30s of waiting, the volume will be 'unmanaged' and this exponential back-off will be cancelled.
The kubelet will then retry the NodePublishVolume call after a few seconds, which won't start with the same exponent (meaning we'll see another flurry of requests).
This is definitely something we should dig into and improve upon... I'm going to look into what options we have to allow us to not lose this state and consistently apply a proper exponential back off.
/assign
Thinking some more on our exponential backoff config + the 30s context deadline in NodePublishVolume, assuming an approval plugin that always denies (or an issuer that always fails), we will create requests as so:
T=0s
T=2s
T=6s
T=14s
T=30s
(context deadline exceeded)
(kubelet calls NodePublishVolume again after N seconds (uncertain exact number, but assume 5s in this example)
T=35s
T=37s
T=41s
T=49s
T=65s
(context deadline exceeded)
...
That means that every 65s we are creating approximately 10 CertificateRequest objects.
We could consider having a more aggressive backoff instead - perhaps not even exponential (say, one per minute instead..).
We can still specifically handle non-approver or issuer based errors (e.g. a transient network failure) and retry quicker in those cases, but for errors like you describe (approver denies or issuer fails), we should definitely create fewer CertificateRequest objects.
The rationale behind re-creating the CertificateRequest is that a future CSR might be approved (even if it is being denied now). For example, if a user has deployed an always-deny approver, sets up the CSI driver and hits this case, they would then go and fix up their approval flow so that these CSRs do get approved in future, and the retrying behaviour means that the pod will then start.
Agree with this reason. Retry with better exp-back-off logic would be good.
That means that every 65s we are creating approximately 10 CertificateRequest objects.
Confirmed. Our log shows a similar sequence (with 0.5 jitter) for "Triggering new issuance".
Because for each issue
call, it will first delete the old CertificateRequest
object, so by default only one CSR object exists at any time.
We can probably re-use some of the codepaths we created for the "continue on not ready" feature to do this, but will need to also record the number of failed attempts in the metadata store too to properly compute the next "wait period".
We could also utilize ctx.Value()
(introduced in PR https://github.com/cert-manager/csi-lib/pull/32) to pass a longer wait time ?
Verified with the fix, denied CR will be retried in following pattern:
T=0s
T=30s
T=60s
T=120s
T=240s
T=300s (5 minutes, the cap)
T=300s (5 minutes, the cap)
...
But found another issue, tracked separately in https://github.com/cert-manager/csi-lib/issues/41 Closing this one.
Related code block: https://github.com/cert-manager/csi-lib/blob/main/manager/manager.go#L305-L309
Observed behavior: When
CertificateRequest
is denied by an approver, the lib will delete it and recreate one. If the approver is auto approve/deny the CSR based on some policies, this would create an infinite loop: create -> denied -> delete -> create again -> ...Sample logs:
Expected behavior: When
CertificateRequest
is denied, should be in terminate state. No new object should be created. Or at least has a flag to stop this infinite loop.A possible solution is to return
true
in line: https://github.com/cert-manager/csi-lib/blob/main/manager/manager.go#L308@munnerz