Kraigie / nostrum

Elixir Discord Library
https://kraigie.github.io/nostrum/
MIT License
602 stars 127 forks source link

Ratelimiter state machine errors #517

Closed tspenov closed 1 year ago

tspenov commented 1 year ago

I got these Ratelimiter errors in the logs after upgrading to latest master. Edit: these seems to be frequent in the logs

2023-06-29T10:54:24.916961392Z stdout F {"level":"error","message":"** State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating\n** Last event = {info,{gun_error,<0.5654.0>,\n                                 #Ref<0.331392565.3752329217.62190>,closed}}\n** When server state  = {connected,#{conn => <0.5654.0>,inflight => #{},\n                                     outstanding =>\n                                         #{<<\"/users/@me\">> =>\n                                               {initial,{[],[]}}},\n                                     remaining_in_window => 48,\n                                     running =>\n                                         #{#Ref<0.331392565.3752329217.62190> =>\n                                               {<<\"/users/@me\">>,\n                                                #{body => <<>>,\n                                                  headers =>\n                                                      [{<<\"content-type\">>,\n                                                        <<\"application/json\">>}],\n                                                  method => get,params => [],\n                                                  route => <<\"/users/@me\">>},\n                                                {<0.6921.0>,\n                                                 #Ref<0.331392565.3752329217.62189>}}}}}\n** Reason for termination = error:function_clause\n** Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']\n** Callback mode = state_functions\n** Stacktrace =\n**  [{'Elixir.Nostrum.Api.Ratelimiter',connected,\n         [info,\n          {gun_error,<0.5654.0>,#Ref<0.331392565.3752329217.62190>,closed},\n          #{conn => <0.5654.0>,inflight => #{},\n            outstanding => #{<<\"/users/@me\">> => {initial,{[],[]}}},\n            remaining_in_window => 48,\n            running =>\n                #{#Ref<0.331392565.3752329217.62190> =>\n                      {<<\"/users/@me\">>,\n                       #{body => <<>>,\n                         headers =>\n                             [{<<\"content-type\">>,<<\"application/json\">>}],\n                         method => get,params => [],route => <<\"/users/@me\">>},\n                       {<0.6921.0>,#Ref<0.331392565.3752329217.62189>}}}}],\n         [{file,\"lib/nostrum/api/ratelimiter.ex\"},{line,397}]},\n     {gen_statem,loop_state_callback,11,[{file,\"gen_statem.erl\"},{line,1426}]},\n     {proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,240}]}]\n** Time-outs: {1,[{{timeout,reset_bot_calls_window},expired}]}\n","timestamp":"2023-06-29T10:54:24.916"}
2023-06-29T10:54:24.921324593Z stdout F {"level":"error","message":"Task #PID<0.6921.0> started from #PID<0.5666.0> terminating\n** (stop) exited in: :gen_statem.call(Nostrum.Api.Ratelimiter, {:queue, %{body: \"\", headers: [{\"content-type\", \"application/json\"}], method: :get, params: [], route: \"/users/@me\"}}, :infinity)\n    ** (EXIT) an exception was raised:\n        ** (FunctionClauseError) no function clause matching in Nostrum.Api.Ratelimiter.connected/3\n            (nostrum 0.9.0-alpha3) lib/nostrum/api/ratelimiter.ex:397: Nostrum.Api.Ratelimiter.connected(:info, {:gun_error, #PID<0.5654.0>, #Reference<0.331392565.3752329217.62190>, :closed}, %{conn: #PID<0.5654.0>, inflight: %{}, outstanding: %{\"/users/@me\" => {:initial, {[], []}}}, remaining_in_window: 48, running: %{#Reference<0.331392565.3752329217.62190> => {\"/users/@me\", %{body: \"\", headers: [{\"content-type\", \"application/json\"}], method: :get, params: [], route: \"/users/@me\"}, {#PID<0.6921.0>, #Reference<0.331392565.3752329217.62189>}}}})\n            (stdlib 4.3) gen_statem.erl:1426: :gen_statem.loop_state_callback/11\n            (stdlib 4.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3\n    (stdlib 4.3) gen.erl:243: :gen.do_call/4\n    (stdlib 4.3) gen_statem.erl:900: :gen_statem.call_dirty/4\n    (nostrum 0.9.0-alpha3) lib/nostrum/api.ex:2602: Nostrum.Api.get_current_user/0\n    (sanbase 0.0.1) lib/sanbase/discord/discord_consumer.ex:256: Sanbase.DiscordConsumer.warm_up/1\n    (sanbase 0.0.1) lib/sanbase/discord/discord_consumer.ex:70: Sanbase.DiscordConsumer.handle_event/1\n    (elixir 1.14.4) lib/task/supervised.ex:89: Task.Supervised.invoke_mfa/2\n    (stdlib 4.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3\nFunction: #Function<1.121426740/0 in Sanbase.DiscordConsumer.handle_info/2>\n    Args: []","timestamp":"2023-06-29T10:54:24.916"}
--

Nostrum version: "nostrum": {:git, "https://github.com/Kraigie/nostrum.git", "49a2944d7e1a953a4cafae18b97e9379da36f3bb", []},, tag version: nostrum 0.9.0-alpha3

Elixir and otp versions:

Erlang/OTP 25 [erts-13.2] [source] [64-bit] [smp:8:1] [ds:8:1:10] [async-threads:1] [jit:ns]

Interactive Elixir (1.14.4) - press Ctrl+C to exit (type h() ENTER for help)

Traceback

** State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating
** Last event = {info,{gun_error,<0.5654.0>,
                                 #Ref<0.331392565.3752329217.62190>,closed}}
** When server state  = {connected,#{conn => <0.5654.0>,inflight => #{},
                                     outstanding =>
                                         #{<<"/users/@me">> =>
                                               {initial,{[],[]}}},
                                     remaining_in_window => 48,
                                     running =>
                                         #{#Ref<0.331392565.3752329217.62190> =>
                                               {<<"/users/@me">>,
                                                #{body => <<>>,
                                                  headers =>
                                                      [{<<"content-type">>,
                                                        <<"application/json">>}],
                                                  method => get,params => [],
                                                  route => <<"/users/@me">>},
                                                {<0.6921.0>,
                                                 #Ref<0.331392565.3752329217.62189>}}}}}
** Reason for termination = error:function_clause
** Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']
** Callback mode = state_functions
** Stacktrace =
**  [{'Elixir.Nostrum.Api.Ratelimiter',connected,
         [info,
          {gun_error,<0.5654.0>,#Ref<0.331392565.3752329217.62190>,closed},
          #{conn => <0.5654.0>,inflight => #{},
            outstanding => #{<<"/users/@me">> => {initial,{[],[]}}},
            remaining_in_window => 48,
            running =>
                #{#Ref<0.331392565.3752329217.62190> =>
                      {<<"/users/@me">>,
                       #{body => <<>>,
                         headers =>
                             [{<<"content-type">>,<<"application/json">>}],
                         method => get,params => [],route => <<"/users/@me">>},
                       {<0.6921.0>,#Ref<0.331392565.3752329217.62189>}}}}],
         [{file,"lib/nostrum/api/ratelimiter.ex"},{line,397}]},
     {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1426}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]
** Time-outs: {1,[{{timeout,reset_bot_calls_window},expired}]}
jchristgit commented 1 year ago

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

tspenov commented 1 year ago

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

Thanks for the quick reply! I'll test it out and let you know

tspenov commented 1 year ago

Please retry with the latest master. I haven't seen this error myself before nor could I reproduce it, but it should deal with automatically requeueing requests that were closed abruptly.

@jchristgit I tested and in logs I see now this error:

Date: 2023-06-29T14:03:02.570
Level: error

State machine 'Elixir.Nostrum.Api.Ratelimiter' terminating

Last event = {
  info, {
    gun_down, <0.5653.0>, http2, closed, [#Ref<0.2297931708.4051173377.228586>]
  }
}

Server state = {
  connected, {
    conn => <0.5653.0>,
    inflight => #{},
    outstanding => {
      <<"/users/@me">> => {
        initial, {
          [{
            #{body => <<>>,
              headers => [{<<content-type>>, <<application/json>>}],
              method => get,
              params => [],
              route => <<"/users/@me">>
            }, 
            {<0.6601.0>, #Ref<0.2297931708.4051173377.228585>}
          }], []
        }
      }
    },
    remaining_in_window => 48,
    running => #{}
  }
}

Reason for termination = error: {badkey, #Ref<0.2297931708.4051173377.228586>}

Callback modules = ['Elixir.Nostrum.Api.Ratelimiter']

Callback mode = state_functions

Stacktrace = [
  {erlang, map_get, [#Ref<0.2297931708.4051173377.228586>, #{}], [{error_info, #{module => erl_erts_errors}}]},
  {'Elixir.Nostrum.Api.Ratelimiter', '-connected/3-fun-1-', 3, [{file, "lib/nostrum/api/ratelimiter.ex"}, {line, 881}]},
  {'Elixir.Enum', '-map/2-lists^map/1-0-', 2, [{file, "lib/enum.ex"}, {line, 1658}]},
  {'Elixir.Nostrum.Api.Ratelimiter', connected, 3, [{file, "lib/nostrum/api/ratelimiter.ex"}, {line, 878}]},
  {gen_statem, loop_state_callback, 11, [{file, "gen_statem.erl"}, {line, 1426}]},
  {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 240}]}
]

Time-outs: {1, [{{timeout, reset_bot_calls_window}, expired}]}
jchristgit commented 1 year ago

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

tspenov commented 1 year ago

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@jchristgit Thank you! I'll test it out tomorrow and let you know.

The elixir app is deployed to AWS, so there shouldn't be any problem with the network. Not sure why the WS to discord gets closed so often. Another thing the app is experiencing very often is {:stream_error, :closed} from gun in which case I just retry the command/interaction.

tspenov commented 1 year ago

I pushed another commit to master, with the same caveats: I cannot really test it.

Are you on an unreliable network? Or do you have another clue why these may keep appearing?

@jchristgit Sometimes after Nostrum.Api.get_current_user() is executed I see in logs: "Request to \"/users/@me\" was closed abnormally, requeueing" but the call doesn't return anything, neither raises an exception. It seems like it is blocking indefinitely.

To work around this I have wrapped it in a Task like this:

  defp warm_up(msg_or_interaction, retries \\ 3) do
    t1 = System.monotonic_time(:millisecond)
    log(msg_or_interaction, "WARM UP STARTING...")

    task = Task.async(fn -> Nostrum.Api.get_current_user() end)

    case Task.yield(task, 5000) || Task.shutdown(task) do
      nil ->
        log(msg_or_interaction, "WARM UP ERROR: Timeout reached.", type: :error)

        if retries > 0 do
          log(msg_or_interaction, "WARM UP TIMEOUT: Retrying...", type: :error)

          warm_up(msg_or_interaction, retries - 1)
        else
          log(msg_or_interaction, "WARM UP TIMEOUT: No more retries.", type: :error)
        end

      {:ok, result} ->
        # handle the result
        handle_result(result, msg_or_interaction, retries)
    end

    t2 = System.monotonic_time(:millisecond)
    log(msg_or_interaction, "Time spent warming up #{t2 - t1}ms.")
  end

Here are some log excerpts:

{"level":"info","message":"[id=1124357812546457661] WARM UP STARTING... msg.content=<@1039814526708764742> hi metadata=%
{"level":"debug","message":"Accounting for request with 50 remaining user calls (initial)","timestamp":"2023-06-30T15:16:32.178"}
{"level":"debug","message":"Accounting for request with 49 remaining user calls","timestamp":"2023-06-30T15:16:32.178"}
{"level":"warning","message":"Request to \"/users/@me\" was closed abnormally, requeueing","timestamp":"2023-06-30T15:16:32.183"}
{"level":"info","message":"[id=1124357812546457661] WARM UP ERROR: Timeout reached. msg.content=<@1039814526708764742> {"level":"info","message":"[id=1124357812546457661] WARM UP TIMEOUT: Retrying...\"852836083381174282\"}","timestamp":"2023-06-30T15:16:37.178"}
{"level":"info","message":"[id=1124357812546457661] WARM UP STARTING... \"852836083381174282\"}","timestamp":"2023-06-30T15:16:37.178"}
{"level":"debug","message":"Accounting for request with 50 remaining user calls (initial)","timestamp":"2023-06-30T15:16:37.202"}
{"level":"debug","message":"Accounting for request with 49 remaining user calls","timestamp":"2023-06-30T15:16:37.202"}
{"level":"info","message":"[id=1124357812546457661] WARM UP SUCCESS ….
{"level":"info","message":"[id=1124357812546457661] Time spent warming up 190ms.
{"level":"info","message":"[id=1124357812546457661] Time spent warming up 5191ms …
jchristgit commented 1 year ago

I cannot reproduce this here, not with spamming the endpoint nor with running it one by one like you do.

Do you have a proxy sitting in front of nostrum? Some outgoing firewall?

tspenov commented 1 year ago

I cannot reproduce this here, not with spamming the endpoint nor with running it one by one like you do. Do you have a proxy sitting in front of nostrum? Some outgoing firewall?

@jchristgit The elixir app is deployed in kubernetes cluster in AWS. The problem is only on the production cluster. Locally and on stage cluster it is fine. On production I can reproduce every time if the WS connection has been idle for 5-10 mins (no other API call to discord has been made). The only difference between stage and prod clusters is that on prod cluster there is AWS NAT Gateway for outgoing requests. The default timeout for tcp connections for the AWS NAT Gateway is 350 seconds ~ 6 mins But I wonder why this happens since there are HEARTBEAT_ACK in the logs pretty much every 40 seconds. They should be refreshing this connection so this timeout from AWS should not happen. Do you have any insight on this ?

jchristgit commented 1 year ago

Sorry but I have no environment where I can reproduce this, at all.

Since you mentioned that the only difference between stage and prod is the AWS NAT Gateway I cannot help but assume that the AWS NAT Gateway is at fault, and nostrum works properly here. Perhaps the NAT Gateway is messing around with HTTP/2 connections or it is doing some other strange form of introspection on the request. In nostrum the bot gateway request does not do anything out of the ordinary in terms of HTTP requests.

tspenov commented 1 year ago

@jchristgit My current workaround is having a job that makes an API call every couple of minutes so it makes sure that the connection doesn't get idle.

Thank you for quick responses and the fixes, really appreciated! 💙

I am closing the issue.