Cannot use Nostrum.Api functions in a multi-host configuration due to the request Ratelimiter not running

jonklein commented 3 months ago

SCENARIO: Nostrum is running in a multi-node environment, with the Nostrum.Application & consumer processes running on just one node and serving bot requests, as described in the documentation. We'd like the other nodes in the cluster to be able to use Nostrum.Api functions needed to perform web requests, background jobs, etc. Currently, these calls fail because the Ratelimiter instance is not running on the other nodes.

I do have a workaround, but it's probably not the right long-term solution. I'm happy to submit a fix, but want to create this issue to figure out the correct approach.

SOLUTIONS TRIED:

Start just the Ratelimiter on the non-main nodes: this fails when a failover occurs – the Nostrum application fails to restart on the new host due to the Ratelimiter already running.
Run the Nostrum application on multiple nodes, but run the main consumer on only a single node: this almost works, but results in duplicate events, because every node's connection dispatches to the consumer, which is globally registered with :pg in ConsumerGroup.

CURRENT WORKAROUND:

I'm running the Nostrum application on multiple nodes as described above, but have forked the library and updated the ConsumerGroup to use :pg.get_local_members instead of :pg.get_members. Even though running the connection on multiple hosts is not the right approach in general, I do feel like get_local_members is more correct here - based on the current multi-node support, I'm not sure what the justification would be for dispatching to consumers on other nodes.

OTHER POSSIBLE SOLUTIONS:

Make it possible to run Nostrom.Api.Ratelimiter independent of Nostrum.Application: if Nostrum.Application allowed an option to not start the Ratelimiter, applications could optionally start it themselves separately on every node.
Register the Ratelimiter as a global process (some tradeoffs are discussed in the Ratelimiter docs)

Any thoughts on the preferred approach, especially taking into consideration possible future distributed multi-node consumer support?

Th3-M4jor commented 3 months ago

For the 1.0.0 release, which should be the next non-patch release, we do have plans to make it so that users have to start Nostrum as part of their Supervision tree instead of leveraging the Application path. As part of that we could make it possible to configure how the ratelimiter module is named.

Also have you tried what's suggested in our multi-node guide?

jchristgit commented 3 months ago

Thanks for the extensive bug report.

About the consumer group:

For proper multi-node support here, I believe we would need to distribute the consumers such that you have only a single "primary" consumer for any relevant shard running - so pg:get_members would still work including awaiting events, just it wouldn't route the events to duplicate "normal" consumers. I think that given the complexity of this topic with all the different distribution strategies it might be best to allow users to just hook out of automatic management of consumers and allow them to start their own, documenting it appropriately to showcase how to do this over multiple nodes.

About the ratelimiter:

I thought about this a while ago, and I believe the best way to solve it would be to run a ratelimiter on each node, and then determine the correct ratelimiter to use via erlang:phash2 of the ratelimit bucket.

Currently we have the get_endpoint/2 method in the ratelimiter which is already used to figure out the correct ratelimiter bucket to run. Instead of obtaining that (only) in the ratelimiter itself, the top-level request function should obtain the bucket for a request on its own, figure out which ratelimiters are there in the cluster, and then route it there accordingly.

The alternative, of course, would be to allow the user to submit their own way to handle this. I think that for the standard usecase that you describe this should be sufficient though.

I will try to make a patch for the ratelimiter phash approach described above together with documentation amendments this weekend, I will get in touch with the other maintainers regarding the best approach for the consumers.

eliasdarruda commented 2 weeks ago

Any news here? I'm also running on some troubles regarding this scenario.

My current scenario is:

Node A - Has both Nostrum.Consumer and Nostrum.Application in the supervision tree Node B - Has Nostrum.Application in the supervision tree

When pointing nostrum dependency to github I still receive double events when running a Consumer on only one node and the Nostrum.Application on both nodes.

Sending messages in both nodes seems to work properly.

If I start only Node B, it fails to identify that there is a Consumer running.

12:24:45.308 [error] No consumers were running nor did any start up in time for shard session startup. Is a consumer started as part of your supervision tree?

12:24:45.311 [error] ** State machine <0.370.0> terminating
** Last event = {info,{gun_upgrade,<0.373.0>,
                                   #Ref<0.3478671056.3312189447.111908>,
                                   [<<"websocket">>],
                                   [{<<"date">>,
                                     <<"Wed, 13 Nov 2024 15:24:40 GMT">>},
                                    {<<"connection">>,<<"upgrade">>},
                                    {<<"sec-websocket-accept">>,
                                     <<"vvPOee46ohPTrGkhzyBRL80JI3s=">>},
                                    {<<"upgrade">>,<<"websocket">>},
                                    {<<"cf-cache-status">>,<<"DYNAMIC">>},
                                    {<<"report-to">>,
                                     <<"{\"endpoints\":[{\"url\":\"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=3Mtyua9I1CpPq1UwK2uIDWqx3sFHsyjGHVdc4tNtCyVWa7pFmvPNMtYFMNTplmBjbjD8zsdV%2F5G%2Bxo577sSDj%2Ft3AyfIm3oLAXSDvFanT%2F%2BZPQXh3U8kAbzaZMgiF4wJ1mfrEA%3D%3D\"}],\"group\":\"cf-nel\",\"max_age\":604800}">>},
                                    {<<"nel">>,
                                     <<"{\"success_fraction\":0,\"report_to\":\"cf-nel\",\"max_age\":604800}">>},
                                    {<<"strict-transport-security">>,
                                     <<"max-age=31536000; includeSubDomains; preload">>},
                                    {<<"x-content-type-options">>,
                                     <<"nosniff">>},
                                    {<<"server">>,<<"cloudflare">>},
                                    {<<"cf-ray">>,
                                     <<"8e1fc51ee85f6258-GRU">>}]}}
** When server state  = {connecting_ws,
                            #{session => nil,
                              stream => #Ref<0.3478671056.3312189447.111908>,
                              seq => nil,
                              '__struct__' => 'Elixir.Nostrum.Struct.WSState',
                              gateway => <<"gateway.discord.gg">>,
                              conn => <0.373.0>,resume_gateway => nil,
                              shard_num => 0,total_shards => 1,
                              last_heartbeat_ack => nil,
                              last_heartbeat_send => nil,
                              heartbeat_interval => nil,heartbeat_ack => nil,
                              compress_ctx => nil,conn_pid => <0.370.0>}}
** Reason for termination = error:{badmatch,timeout}
** Callback modules = ['Elixir.Nostrum.Shard.Session']
** Callback mode = [state_functions,state_enter]
** Stacktrace =
**  [{'Elixir.Nostrum.Shard.Session',connected,3,
                                     [{file,"lib/nostrum/shard/session.ex"},
                                      {line,330}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,3735}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]
** Time-outs: {1,[{state_timeout,upgrade_timeout}]}

If I start Node A after Node B. It spins up a consumer and it works as expected.

However, if I start Node A and Node B after that. It will identify the consumer from Node A. And receive duplicated events as if Node B had the consumer on its own supervision tree (which is not the case).

Another thing to mention that I tried is that:

If you run the Nostrum.Application only on a single node (which doesn't seem to be intended anyways), it fails to send messages on the other node. The error below happens:

** (ArgumentError) errors were found at the given arguments:

* 2nd argument: out of range

  :erlang.phash2("/channels/<my_channel_id>/messages", 0)
  (nostrum 0.10.0) lib/nostrum/api/ratelimiter_group.ex:44: Nostrum.Api.RatelimiterGroup.limiter_for_bucket/1
  (nostrum 0.10.0) lib/nostrum/api/ratelimiter.ex:982: Nostrum.Api.Ratelimiter.queue/1
  (nostrum 0.10.0) lib/nostrum/api.ex:289: Nostrum.Api.create_message/2
  (nostrum 0.10.0) lib/nostrum/api.ex:299: Nostrum.Api.create_message!/2
  iex:1: (file)

Expected scenario:

I start Node A and Node B concurrently and only one consumer instance is identified.

eliasdarruda commented 1 week ago

ping @jchristgit

Kraigie / nostrum

Cannot use Nostrum.Api functions in a multi-host configuration due to the request Ratelimiter not running #620