Kraigie / nostrum

Elixir Discord Library
https://kraigie.github.io/nostrum/
MIT License
621 stars 129 forks source link

Cannot use Nostrum.Api functions in a multi-host configuration due to the request Ratelimiter not running #620

Open jonklein opened 3 months ago

jonklein commented 3 months ago

SCENARIO: Nostrum is running in a multi-node environment, with the Nostrum.Application & consumer processes running on just one node and serving bot requests, as described in the documentation. We'd like the other nodes in the cluster to be able to use Nostrum.Api functions needed to perform web requests, background jobs, etc. Currently, these calls fail because the Ratelimiter instance is not running on the other nodes.

I do have a workaround, but it's probably not the right long-term solution. I'm happy to submit a fix, but want to create this issue to figure out the correct approach.

SOLUTIONS TRIED:

CURRENT WORKAROUND:

I'm running the Nostrum application on multiple nodes as described above, but have forked the library and updated the ConsumerGroup to use :pg.get_local_members instead of :pg.get_members. Even though running the connection on multiple hosts is not the right approach in general, I do feel like get_local_members is more correct here - based on the current multi-node support, I'm not sure what the justification would be for dispatching to consumers on other nodes.

OTHER POSSIBLE SOLUTIONS:

Any thoughts on the preferred approach, especially taking into consideration possible future distributed multi-node consumer support?

Th3-M4jor commented 3 months ago

For the 1.0.0 release, which should be the next non-patch release, we do have plans to make it so that users have to start Nostrum as part of their Supervision tree instead of leveraging the Application path. As part of that we could make it possible to configure how the ratelimiter module is named.

Also have you tried what's suggested in our multi-node guide?

jchristgit commented 3 months ago

Thanks for the extensive bug report.

About the consumer group:

For proper multi-node support here, I believe we would need to distribute the consumers such that you have only a single "primary" consumer for any relevant shard running - so pg:get_members would still work including awaiting events, just it wouldn't route the events to duplicate "normal" consumers. I think that given the complexity of this topic with all the different distribution strategies it might be best to allow users to just hook out of automatic management of consumers and allow them to start their own, documenting it appropriately to showcase how to do this over multiple nodes.

About the ratelimiter:

I thought about this a while ago, and I believe the best way to solve it would be to run a ratelimiter on each node, and then determine the correct ratelimiter to use via erlang:phash2 of the ratelimit bucket.

Currently we have the get_endpoint/2 method in the ratelimiter which is already used to figure out the correct ratelimiter bucket to run. Instead of obtaining that (only) in the ratelimiter itself, the top-level request function should obtain the bucket for a request on its own, figure out which ratelimiters are there in the cluster, and then route it there accordingly.

The alternative, of course, would be to allow the user to submit their own way to handle this. I think that for the standard usecase that you describe this should be sufficient though.

I will try to make a patch for the ratelimiter phash approach described above together with documentation amendments this weekend, I will get in touch with the other maintainers regarding the best approach for the consumers.

eliasdarruda commented 2 weeks ago

Any news here? I'm also running on some troubles regarding this scenario.

My current scenario is:

Node A - Has both Nostrum.Consumer and Nostrum.Application in the supervision tree Node B - Has Nostrum.Application in the supervision tree

When pointing nostrum dependency to github I still receive double events when running a Consumer on only one node and the Nostrum.Application on both nodes.

Sending messages in both nodes seems to work properly.

If I start only Node B, it fails to identify that there is a Consumer running.

12:24:45.308 [error] No consumers were running nor did any start up in time for shard session startup. Is a consumer started as part of your supervision tree?

12:24:45.311 [error] ** State machine <0.370.0> terminating
** Last event = {info,{gun_upgrade,<0.373.0>,
                                   #Ref<0.3478671056.3312189447.111908>,
                                   [<<"websocket">>],
                                   [{<<"date">>,
                                     <<"Wed, 13 Nov 2024 15:24:40 GMT">>},
                                    {<<"connection">>,<<"upgrade">>},
                                    {<<"sec-websocket-accept">>,
                                     <<"vvPOee46ohPTrGkhzyBRL80JI3s=">>},
                                    {<<"upgrade">>,<<"websocket">>},
                                    {<<"cf-cache-status">>,<<"DYNAMIC">>},
                                    {<<"report-to">>,
                                     <<"{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=3Mtyua9I1CpPq1UwK2uIDWqx3sFHsyjGHVdc4tNtCyVWa7pFmvPNMtYFMNTplmBjbjD8zsdV%2F5G%2Bxo577sSDj%2Ft3AyfIm3oLAXSDvFanT%2F%2BZPQXh3U8kAbzaZMgiF4wJ1mfrEA%3D%3D\"}],\"group\":\"cf-nel\",\"max_age\":604800}">>},
                                    {<<"nel">>,
                                     <<"{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}">>},
                                    {<<"strict-transport-security">>,
                                     <<"max-age=31536000; includeSubDomains; preload">>},
                                    {<<"x-content-type-options">>,
                                     <<"nosniff">>},
                                    {<<"server">>,<<"cloudflare">>},
                                    {<<"cf-ray">>,
                                     <<"8e1fc51ee85f6258-GRU">>}]}}
** When server state  = {connecting_ws,
                            #{session => nil,
                              stream => #Ref<0.3478671056.3312189447.111908>,
                              seq => nil,
                              '__struct__' => 'Elixir.Nostrum.Struct.WSState',
                              gateway => <<"gateway.discord.gg">>,
                              conn => <0.373.0>,resume_gateway => nil,
                              shard_num => 0,total_shards => 1,
                              last_heartbeat_ack => nil,
                              last_heartbeat_send => nil,
                              heartbeat_interval => nil,heartbeat_ack => nil,
                              compress_ctx => nil,conn_pid => <0.370.0>}}
** Reason for termination = error:{badmatch,timeout}
** Callback modules = ['Elixir.Nostrum.Shard.Session']
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{'Elixir.Nostrum.Shard.Session',connected,3,
                                     [{file,"lib/nostrum/shard/session.ex"},
                                      {line,330}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,3735}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]
** Time-outs: {1,[{state_timeout,upgrade_timeout}]}

If I start Node A after Node B. It spins up a consumer and it works as expected.

However, if I start Node A and Node B after that. It will identify the consumer from Node A. And receive duplicated events as if Node B had the consumer on its own supervision tree (which is not the case).


Another thing to mention that I tried is that:

If you run the Nostrum.Application only on a single node (which doesn't seem to be intended anyways), it fails to send messages on the other node. The error below happens:

** (ArgumentError) errors were found at the given arguments:

* 2nd argument: out of range

  :erlang.phash2("/channels/<my_channel_id>/messages", 0)
  (nostrum 0.10.0) lib/nostrum/api/ratelimiter_group.ex:44: Nostrum.Api.RatelimiterGroup.limiter_for_bucket/1
  (nostrum 0.10.0) lib/nostrum/api/ratelimiter.ex:982: Nostrum.Api.Ratelimiter.queue/1
  (nostrum 0.10.0) lib/nostrum/api.ex:289: Nostrum.Api.create_message/2
  (nostrum 0.10.0) lib/nostrum/api.ex:299: Nostrum.Api.create_message!/2
  iex:1: (file)

Expected scenario:

I start Node A and Node B concurrently and only one consumer instance is identified.

eliasdarruda commented 1 week ago

ping @jchristgit