Open recursiveGecko opened 4 months ago
Here are results of my production trace (Rexbug.start(["Nostrum.Api :: return", "Nostrum.Api.Ratelimiter :: return"])
) with every function call and return value logged, so it should be possible to follow through the internal state of the state machine.
nostrum-bug-trace-redacted.log
Line 3166 is where the send_message
function call gets stuck. A lot of the information has been redacted, but hopefully in a way that doesn't impede understanding.
I'm using Elixir 1.14 with OTP 26. I hope this helps!
First of all, sorry for the long delay in getting back to you. I've read this multiple times now but unfortunately I'm still unsure about the actual issue.
The original issue seems mostly clear to me, and I believe there might be a bug
in there. Specifically the only way I see how that can happen is if the
ratelimiter would not go into :disconnected
state before the second request
is sent, and the request is then somehow silently dropped in the ratelimiter. I
can't really see why so far, so I looked into the log you attached.
This is where things get a bit confusing to me. You say line 3166 is where it
gets stuck - do you mean the call in IEx or the outstanding call to channel
aaaaaaaa
? Because for the outstanding request to create a message in channel
aaaaaa
, it just always sits at :initial
, which means nostrum is still
waiting for a reply, and for a long time judging by the length of your bug
trace. I see two possibilities for that to happen:
gun
never sends us a reply: this shouldn't happen, but we should
probably add a reasonably long timeout to outbound requests to exclude the
possibility. There is also a ticket upstream at ninenines/gun#334, which I
wonder if it's related.
There is a bug in the ratelimiter for handling of :initial
requests:
Certainly possible, but I don't see how so far, since it's a special atom
value we match against and we do integer arithmetic to deal with things
which would crash...
But what confuses me is that the issue in your log and the issue you originally opened seemed to be two separate problems - or am I misunderstanding it?
Either way, I don't think it will solve this problem, I think we should maybe introduce some reasonably long timeouts on all outbound requests, and maybe set up a debug monitor for them somehow. Such that we can say, "this thing was sent but never finished". @jb3 would be happy for your thoughts on this as well.
By the way, thanks for the super useful bug trace. FWIW, if you need to debug
this again, the ratelimiter also has :sys.trace
support, which might yield
some other information.
Hi, I've run into an odd issue where
Nostrum.Api.create_message
call never returns.Here's a relevant log message from the library itself that happens just before this issue occurs:
Here's a message from my application, immediately following it. This indicates that this particular
Api
call returned:But then the following
Nostrum.Api.create_message
call from the same process seems to hang in this state indefinitely:...While
Ratelimiter
doesn't seem to be doing any work:...but
Nostrum.Api.create_message
calls from an attached console continue to work just fine:For context,
F1Bot.Output.Discord
is a long-running process that receives messages from PubSub and publishes them as Discord messages, but it ends up blocked whenNostrum.Api
doesn't return.I'm using Nostrum from commit
4fabfc5bf59878fdde118acd686f6a5e075b5f8e
due to #522. Happy to provide more info!