fluxcd / flux

Successor: https://github.com/fluxcd/flux2
https://fluxcd.io
Apache License 2.0
6.9k stars 1.08k forks source link

fix(repocachemanager): adding rate limit condition message #3543

Closed fdpeiter closed 3 years ago

fdpeiter commented 3 years ago

Signed-off-by: Felipe Peiter felipe.peiter@wildlifestudios.com

Recently we've started seeing several messages related to the Dockerhub pull limit in our clusters:

{"auth":"{map[]}","caller":"repocachemanager.go:226","canonical_name":"index.docker.io/connecteverything/nats-operator","component":"warmer","err":{"errors":[{"code":"TOOMANYREQUESTS","message":"You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"}]},"ref":"docker.io/connecteverything/nats-operator:0.8.0","ts":"2021-08-17T15:50:10.734988918Z"}

This is unexpected, as there is a specific switch case for handling rate-limiting here. I checked the content of err.Error() and got this: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit.

I've added this error message as a new condition for the switch case. I've tested it already and got the expected behaviour:

{"auth":"{map[]}","caller":"repocachemanager.go:216","canonical_name":"index.docker.io/tfgco/will-iam","component":"warmer","ts":"2021-08-20T17:20:46.220746763Z","warn":"aborting image tag fetching due to rate limiting, will try again later"}
kingdonb commented 3 years ago

Thanks for making a contribution here! We can certainly get this merged, and CI will push out a pre-release image with the change. It may take some time to get another release scheduled, as we have already just had a new minor release.

Can you say something about the impact on your cluster of introducing this change? It's not easy for me to follow what's going on here in the code. On the latest version of Flux, v1.24.0, I get that these rate limiting messages would be misclassified as a generic error, and so I imagine that none of the rate limiting mitigation triggering backoff would be triggered, but I'm not really sure what is the direct impact or significance of warnAboutRateLimit.

Is there any actual quantified performance impact of traveling down the wrong code path here, from what you've seen I'm assuming you've spent some time testing this change out locally? (Or is the result just to simply correct a bad behavior in the logs?)

fdpeiter commented 3 years ago

Hello there, @kingdonb. Thanks for the quick response!

I actually did got the chance to dig a little bit into this. After implementing the change, it seems that only 1 single call to update the images is done, which greatly helps lowering down the number of pulls on DockerHub: Screen Shot 2021-08-23 at 13 11 22

This graph show how it was before, by fetching the TOOMANYREQUESTS exception: Screen Shot 2021-08-23 at 13 13 48

This particular application only fetched 1 single image from DockerHub, but this behavior was more visible in applications without Registry scan restrictions.

I do suspect that the default switch case don't cancel the scan after failure, while the TOOMANYREQUESTS exception does.

Please let me know if you need more information.

kingdonb commented 3 years ago

Thanks very much, that's exactly the kind of color I wanted to add to this report before merging this. So, the performance is affected because Flux fails to back off properly when the "too many requests" exception isn't properly handled as a rate limit.

I'd like to go ahead and merge this as soon as possible, however I'm considering leaving the PR open for a couple of days to see whether anyone can poke holes in it. I'd be interested to know if this is just something new that Docker Hub has started, and whether there might be a third different way that some other image repo host is sending back their 429 errors that also doesn't match this one, if there might be more cases we need to cover here. If someone else reports this issue is affecting them in the same way, we can merge it more quickly and get a release out soon.

I don't want to wait too long since many people could be affected by this issue if it impacts Docker Hub users, but if you have already remediated the issue in your own Flux deployments, I think I could delay a while to avoid spending a half day or full day promoting a release when we might have another PR opened tomorrow that could be merged in the same release. I will mark this for the next release milestone and get back to you soon about it.

Thanks again for contributing this! Sorry for the poor experience for anyone who finds this issue that was impacted by it.

fdpeiter commented 3 years ago

Thanks very much, that's exactly the kind of color I wanted to add to this report before merging this. So, the performance is affected because Flux fails to back off properly when the "too many requests" exception isn't properly handled as a rate limit.

That sounds correct, as it doesn't seem to fetch all images after rate-limited.

I don't want to wait too long since many people could be affected by this issue if it impacts Docker Hub users, but if you have already remediated the issue in your own Flux deployments, I think I could delay a while to avoid spending a half day or full day promoting a release when we might have another PR opened tomorrow that could be merged in the same release. I will mark this for the next release milestone and get back to you soon about it.

That makes a lot of sense! I will roll out this for 40+ clusters this week, and I can bring updates if I find some—no need to rush.

Thanks again!

kingdonb commented 3 years ago

3544 is open tracking the release

I will probably wait for your report about how it went, before I merge this back to 1.24.x branch and push the release out.

Thanks for coordinating this with me!

fdpeiter commented 3 years ago

Hey there, @kingdonb!

Results seems great, for comparison: Screen Shot 2021-08-25 at 12 50 00 Those are logs related to "aborting image tag fetching due to rate limiting, will try again later". Screen Shot 2021-08-25 at 12 50 48 Those are logs for TOOMANYREQUESTS.

You can safely ignore those two peaks, as I did a memcached upgrade (cleaning the cache).