Simple signing verification sometimes fails due to transient/network errors

jhart1685 commented 3 years ago

What commit ID of Portieris did you experience the problem with?

v9.5.0

What went wrong?

Portieris blocked deployment of an image with a simple signing policy. Logs showed:

imagePolicy.go:63] SimpleSigning verification: ImagePullSecret with username <REDACTED> for image <REDACTED> failed, no more secrets in scope (secret 3/3). Failing. Error Get "<REDACTED>": EOF

The failing GET is retrieving a manifest from the registry.

What should have happened differently?

EOFs and connection timeouts can occur for a variety of reasons - both at the registry end and the calling client (Portieris in this case). Most clients are hardened to retry in these kind of cases, to provide resiliency against transient issues. I would expect Portieris to have at least one retry here before blocking the deployment.

How can it be reproduced?

There's not a specific reproduce, it will occur occasionally over time.

Any other relevant information

Care is needed to ensure retries are not attempted for all failures. The code in question is inside a loop that can try a variety of different credentials. Obviously a 403 does not need to be retried.

jhart1685 commented 3 years ago

To expand a bit on the last section in the description above..

I believe that ImageReferece.NewImageSource() (in the containers library) will return an error , which is not a useful "typed" error , for both the "invalid creds" case and the "connection error" case. With the code as currently structured, it is difficult to see how you can harden it against network glitches without also adding an unwanted retry to all attempts that use the wrong/invalid credentials.

sjhx commented 3 years ago

Essentially a temporary/transient failure in say registry oauth becomes a permanent failure blocking a deployment. If the deployment was admitted (e.g. portieris not installed) the transient error would be retried by the container runtime and eventually succeeed. While should look to mitigate short transient failure ultimately the admission webhook will timeout and the deployment should be rejected, it may be that we can indicate the transient failure on the webhook response but the deployment process would have to handle and retry too.

sjhx commented 3 years ago

Lets see how the above issue is received and decide whether to PR container/image or to add some string matching (which could be unit tested for safety).

brackendawson commented 3 years ago

The containers/image library, or at least the parts we rely on here appear to wrap errors extensively. If we also adopt error wrapping properly then we can choose the appropriate place to perform retries and do so if an error in the "stack" of wrapped errors implements net.Error and is .Temporary().

Though EOF and connection refused are not considered Temporary, and a 5XX response will not implement net.Error.

brackendawson commented 3 years ago

So I think there are four possible avenues here:

Retry all failed admissions n times. This would retry when we hit an EOF as above, but would also retry inappropriately for unauthorized etc. and would also mean we have to deal with how we present errors from multiple attempts to the user.
Retry typed errors as in my previous comment. This would re-try some errors, but not the EOF error that this issue was raised for. We could solve that by retrying any error that implements net.Error regardless of if it is temporary, but this could be too wide as it might catch other errors that unintentionally implement this interface. I don't consider string matching on the errors to be an acceptable approach. There is also this library which makes some attempt to decide if errors are re-tryable: https://github.com/containers/common/tree/master/pkg/retry
We could make an upstream change to the containers/image library to allow us to wrap the http.RoundTripper used by github.com/containers/image/docker.dockerClient in a similar way to what is offered by k8s.io/client-go/rest.Config.WrapTransport, this would allow us to re-try on any error that comes back from the actual attempt to perform the request. I'm reaching out to the maintainers to discuss how such an approach could work, but I'm not confident it can be done cleanly.
Not retry. A difference between portieris and the container runtime that would be talking to the registry after admission is that portieris will always have a "user", be that a human calling kubectl apply or helm, or somebody's automation. We always have someone to tell about our errors. Containerd must re-try because it is a daemon, it must retry to serve its base function.

brackendawson commented 3 years ago

After extensive testing I have been unable to provoke any retry-able errors talking to a registry. I do not believe it is worth the complexity to add a re-try and think this issue should be closed.

While working on this PR I found some related code bugs and have opened #273 to address those.

IBM / portieris