Kraigie / nostrum

Elixir Discord Library
https://kraigie.github.io/nostrum/
MIT License
598 stars 128 forks source link

Cannot use Nostrum.Api functions in a multi-host configuration due to the request Ratelimiter not running #620

Open jonklein opened 1 month ago

jonklein commented 1 month ago

SCENARIO: Nostrum is running in a multi-node environment, with the Nostrum.Application & consumer processes running on just one node and serving bot requests, as described in the documentation. We'd like the other nodes in the cluster to be able to use Nostrum.Api functions needed to perform web requests, background jobs, etc. Currently, these calls fail because the Ratelimiter instance is not running on the other nodes.

I do have a workaround, but it's probably not the right long-term solution. I'm happy to submit a fix, but want to create this issue to figure out the correct approach.

SOLUTIONS TRIED:

CURRENT WORKAROUND:

I'm running the Nostrum application on multiple nodes as described above, but have forked the library and updated the ConsumerGroup to use :pg.get_local_members instead of :pg.get_members. Even though running the connection on multiple hosts is not the right approach in general, I do feel like get_local_members is more correct here - based on the current multi-node support, I'm not sure what the justification would be for dispatching to consumers on other nodes.

OTHER POSSIBLE SOLUTIONS:

Any thoughts on the preferred approach, especially taking into consideration possible future distributed multi-node consumer support?

Th3-M4jor commented 1 month ago

For the 1.0.0 release, which should be the next non-patch release, we do have plans to make it so that users have to start Nostrum as part of their Supervision tree instead of leveraging the Application path. As part of that we could make it possible to configure how the ratelimiter module is named.

Also have you tried what's suggested in our multi-node guide?

jchristgit commented 1 month ago

Thanks for the extensive bug report.

About the consumer group:

For proper multi-node support here, I believe we would need to distribute the consumers such that you have only a single "primary" consumer for any relevant shard running - so pg:get_members would still work including awaiting events, just it wouldn't route the events to duplicate "normal" consumers. I think that given the complexity of this topic with all the different distribution strategies it might be best to allow users to just hook out of automatic management of consumers and allow them to start their own, documenting it appropriately to showcase how to do this over multiple nodes.

About the ratelimiter:

I thought about this a while ago, and I believe the best way to solve it would be to run a ratelimiter on each node, and then determine the correct ratelimiter to use via erlang:phash2 of the ratelimit bucket.

Currently we have the get_endpoint/2 method in the ratelimiter which is already used to figure out the correct ratelimiter bucket to run. Instead of obtaining that (only) in the ratelimiter itself, the top-level request function should obtain the bucket for a request on its own, figure out which ratelimiters are there in the cluster, and then route it there accordingly.

The alternative, of course, would be to allow the user to submit their own way to handle this. I think that for the standard usecase that you describe this should be sufficient though.

I will try to make a patch for the ratelimiter phash approach described above together with documentation amendments this weekend, I will get in touch with the other maintainers regarding the best approach for the consumers.