Denhac / denhac-webhooks

This repo holds some of our membership automation stuff. It listens to webhooks from the main wordpress site
7 stars 2 forks source link

Requests to slack conversations.list are exceeding a rate limit #46

Closed chungl closed 4 months ago

chungl commented 5 months ago

This issue was first noticed following (and appears connected to) my recent update enabling batch submissions of equipment authorizations (#38).

We're seeing rate limit errors (429 Too Many Requests) from slack on the https://denhac.slack.com/api/conversations.list endpoint initiated by the AddToChannel job. These errors come in rapid succession shortly following a submission of equipment authorizations.

It appears that when denhac.org receives the batch of userMemberships, it sends us a batch of userMembership.created messages, each of which triggers a call to SlackReactor::onUserMembershipCreated, which in turn calls AddToChannel::queue()->execute($customerId, $slackId). execute calls SlackActionTrait::channelIdFromChannel, which calls ConversationsApi::toSlackIds, which calls ConversationsApi::list, which throws the error.

Both occurrences of this problem have been for woodshop training. I'm not sure what slack channel it's trying to get (#help-woodworking?), but there's four tools being authorized (Table saw, jointer, planer, router table). One or more of them must have user_slack_id or trainer_slack_id. No trainers have been authorized in either of these batches, so it should be a user_slack_id.

We could likely eliminate the error by caching the results of channelIdFromChannel, as it's read many times and unlikely to change - especially in short succession.

However, it's a bit surprising that we need to call channelIdFromChannel at all, given that the input to the AddToChannel job is named _slack_id. Perhaps it's an inconsistent naming convention, or perhaps we're doing unnecessary and/or nonsensical work in this job.

Jnesselr commented 5 months ago

You have the flow more or less correct. The only difference is quibbling, but just for training purposes, userMembership.created creates a UserMembershipCreated event through ProcessWebhookJob by calling through the MemberhsipAggregate.

All of those have a user_slack_id and all point to #help-woodworking. It's a _slack_id being passed in but it can also be the slack name and it converts automatically through that method call. Channel ids, start with a C almost always, but supposedly can also start with a D. They're not super public about it and have changed id formats before so I just decided to rely on the api call and make sure to always get the right channel name no matter what gets passed in.

Slack's rate limiting has tiers and is per method for our case.

How I would solve this (and I probably will end up solving this one):

  1. Add a SlackRateLimited class with static methods for various API methods we might use a lot as well as per tier. The per tier ones would share their space and if we end up using one a lot, congrats it gets its own.
  2. Each static method ensures that the right RateLimiter is defined. We currently do that in the AppServiceProvider but if we're doing a bunch of them, I'd rather make them dynamic. Note that the one I linked is for the specific operation, not for the api methods. I'd refactor that.
  3. In AddToChannel, I'd use middleware for the rate limiting, similar for the one used here. I would have a rate limit for conversations list, conversations invite, and conversations join.

The negative to this approach is you do have to make sure when you're adding a new api method, you also include those methods in the rate limit. I say "have to", but really we'd just see failed jobs and realize "oops we missed one".

The benefits are that this is highly composable, meaning we can have a lot of things going on all at the same time and if we wouldn't be able to call just one of those methods, it's fine it'll just re-queue. If another job didn't need one of the methods that was currently used up, it would be able to execute just fine.