Further improve how we minimize the possibility of GitHub AbuseException

Jericho commented 3 years ago

We seem to be triggering GitHub's abuse detection more frequently as of late and the result is that our automated process raises an issue to add/modify/delete a given yaml file but we don't submit the corresponding PR.

It could be caused by another process running under the cake-contrib-bot user and making GitHub API calls around the same time the discoverer is scheduled (need to ask the @cake-contrib/cake-team if any new process is running), or maybe GitHub has changed the heuristics they use to determine if a process is abusing their API. To my knowledge, they don't disclose what these heuristics and therefore not much we can do about it. Either @devlead or @gep13 (it's been a long time, so I don't remember who it was) talked to someone at GitHub about it at some point and they acknowledged that they had heuristics in place, above and beyond the publicly disclosed "rate limits", but they have not disclosed what they are.

That's why I added some defense mechanisms in the discoverer such as:

don't delete more than 75 yaml files per scheduled run
don't modify more than 75 yaml files per scheduled run
don't create more than 75 yaml files per scheduled run
delay after processing each yaml file. This delay used to be 1 second, it was randomized (between 600 milliseconds and 2 seconds) last month. Maybe the fact that the delay can now be as low as 600 milliseconds is part of the problem???
check the result of Octokit's .GetLastApiInfo() and stop processing if there are less than an arbitrary but reasonable (as determined by me) number of calls remaining

But evidently, these are not sufficient, so I propose the following:

[ ] don't process more than 75 yaml files TOTAL per scheduled run (as opposed to 75 delete, 75 modify and 75 create). I don't think this will make a big difference because we rarely delete/modify/add more than 75 files at a time but better safe than sorry.
[ ] double the random delay (1200 milliseconds to 4 seconds)
[ ] increase the "minimum number of remaining calls" threshold. Currently 250, maybe increase it to 500???
[ ] display more info in the log when we trigger AbuseException. e.g.: what was the remaining number of API calls prior to the exception, how many yaml file were processed prior to the exception, etc. Maybe this could help us figure out why we are triggering it.

Now that I think about it, I should implement the change to display additional info in the log before changing anything else. Maybe this new information will help me understand the root cause of this problem.

augustoproiete commented 3 years ago

Maybe the fact that the delay can now be as low as 600 milliseconds is part of the problem???

Probably yes, according to GitHub's docs:

"If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request."

So maybe a random wait between 1.5 and 4 seconds would trigger the rate limit less often?

One other thing you might want to consider is to read the Retry-After response when you get the exception, and pause until after the time they say to try again. Maybe Polly policies to retry?

From the docs:

"When you have been limited, use the Retry-After response header to slow down. The value of the Retry-After header will always be an integer, representing the number of seconds you should wait before making requests again. For example, Retry-After: 30 means you should wait 30 seconds before sending more requests".

https://docs.github.com/en/rest/guides/best-practices-for-integrators#dealing-with-abuse-rate-limits

Jericho commented 3 years ago

So maybe a random wait between 1.5 and 4 seconds would trigger the rate limit less often?

<sarcasm>Thanks for suggesting something I already proposed</sarcasm> :stuck_out_tongue_winking_eye:

devlead commented 3 years ago

@patriksvensson correct me if I'm wrong by I think headers returned from GitHub API contains rate limit info, so one might not need to resort to random, but know when to retry.

Edit: saw @augustoproiete already linked to those docs.

patriksvensson commented 3 years ago

@devlead Correct. If we're doing many calls against the same endpoint and resource, we can also cache the ETag between calls. Using an ETag in a GitHub API call counts as 0 (zero) against the rate limit.

Jericho commented 3 years ago

Yes headers contain rate limit info and as I previously indicated I already respect them. Also, there's nothing to cache because we trigger the abuse detection mostly when commiting, creating an issue or submitting a PR. The issue is not retrieving from the API, it's creating, commiting, etc.

I wish the exception from GitHub said something like: You triggered our abuse detection because .... That would tell us exactly how to fix and prevent this problem.

devlead commented 3 years ago

Maybe solution is to run it more often so fewer PRs created at a time? I.e. what would happen if it ran every hour, half hour, etc?

Jericho commented 3 years ago

@devlead that might indeed be a good solution to fall back on if I can't prevent AbuseExceptions: reduce the number of yaml file synchronized per run but run the process more often. GitHub's rate limits are on a "per-hour" basis, therefore we couldn't run our process more often than once an hour. Maybe 4 times a day would be more reasonable?

I still want to figure out why we get this exception and how to prevent it in the first place. It's bothering me!

By the way, one theory I mentioned in my original comment is that there may be another process running under the same cake-contrib-bot identity and consuming GitHub API calls which could interfere with the AddinDiscoverer. Anybody aware of any such process? Maybe a build pipeline, CI, etc.???

devlead commented 3 years ago

Could having a separate addin discover user be an alternative?

cake-contrib / Cake.AddinDiscoverer

Further improve how we minimize the possibility of GitHub AbuseException #188