Open jhart1685 opened 3 years ago
To expand a bit on the last section in the description above..
I believe that ImageReferece.NewImageSource()
(in the containers library) will return an error , which is not a useful "typed" error , for both the "invalid creds" case and the "connection error" case.
With the code as currently structured, it is difficult to see how you can harden it against network glitches without also adding an unwanted retry to all attempts that use the wrong/invalid credentials.
Essentially a temporary/transient failure in say registry oauth becomes a permanent failure blocking a deployment. If the deployment was admitted (e.g. portieris not installed) the transient error would be retried by the container runtime and eventually succeeed. While should look to mitigate short transient failure ultimately the admission webhook will timeout and the deployment should be rejected, it may be that we can indicate the transient failure on the webhook response but the deployment process would have to handle and retry too.
Lets see how the above issue is received and decide whether to PR container/image or to add some string matching (which could be unit tested for safety).
The containers/image
library, or at least the parts we rely on here appear to wrap errors extensively. If we also adopt error wrapping properly then we can choose the appropriate place to perform retries and do so if an error in the "stack" of wrapped errors implements net.Error
and is .Temporary()
.
Though EOF
and connection refused
are not considered Temporary, and a 5XX response will not implement net.Error
.
So I think there are four possible avenues here:
net.Error
regardless of if it is temporary, but this could be too wide as it might catch other errors that unintentionally implement this interface. I don't consider string matching on the errors to be an acceptable approach. There is also this library which makes some attempt to decide if errors are re-tryable: https://github.com/containers/common/tree/master/pkg/retrycontainers/image
library to allow us to wrap the http.RoundTripper
used by github.com/containers/image/docker.dockerClient
in a similar way to what is offered by k8s.io/client-go/rest.Config.WrapTransport
, this would allow us to re-try on any error that comes back from the actual attempt to perform the request. I'm reaching out to the maintainers to discuss how such an approach could work, but I'm not confident it can be done cleanly.kubectl apply
or helm
, or somebody's automation. We always have someone to tell about our errors. Containerd must re-try because it is a daemon, it must retry to serve its base function.After extensive testing I have been unable to provoke any retry-able errors talking to a registry. I do not believe it is worth the complexity to add a re-try and think this issue should be closed.
While working on this PR I found some related code bugs and have opened #273 to address those.
What commit ID of Portieris did you experience the problem with?
v9.5.0
What went wrong?
Portieris blocked deployment of an image with a simple signing policy. Logs showed:
The failing GET is retrieving a manifest from the registry.
What should have happened differently?
EOFs and connection timeouts can occur for a variety of reasons - both at the registry end and the calling client (Portieris in this case). Most clients are hardened to retry in these kind of cases, to provide resiliency against transient issues. I would expect Portieris to have at least one retry here before blocking the deployment.
How can it be reproduced?
There's not a specific reproduce, it will occur occasionally over time.
Any other relevant information
Care is needed to ensure retries are not attempted for all failures. The code in question is inside a loop that can try a variety of different credentials. Obviously a 403 does not need to be retried.