Kile / Killua

Source code for the discord bot Killua
https://killua.dev
GNU General Public License v3.0
34 stars 4 forks source link

ZeroMQ IPC fails after a while #607

Open Kile opened 5 months ago

Kile commented 5 months ago

This has been an issue for a while. In a development environment, zeromq works perfectly fine, however not long after a restart of the production code zeromq requests will start failing silently. This makes vote rewards not work as well as all GET endpoints used for the website, rendering it nearly completely useless. Several attempted fixes were implemented but none have worked so far.

This issue occurs in these lines of code: Sever: https://github.com/Kile/Killua/blob/7bf697e3c4f1cf75f62d274cda3b537e0d5a3fd8/killua/cogs/api.py#L27-L57

Client: https://github.com/Kile/Killua/blob/7bf697e3c4f1cf75f62d274cda3b537e0d5a3fd8/killua/webhook/api.py#L30-L50

I suspected this was because of too many open connections but I am not sure if this is the case and I seem to close all connections. This is the output of an lsof command when this issue occurred in production:

Because this has been a longer ongoing issue and because it is quite important for the functionality I am turning this into an issue to keep track on the progress.

I have also asked this stack overflow question in hopes of a fix.

Kile commented 5 months ago

This seems to be an issue with the API, not zeromq. I can still internally request zeromq however the API fails. I remember it failing after a while before I created the website from time to time, it seems with the large number of additional requests this happens much faster. Only I am not sure why. I will continue investigating.

image image

Kile commented 5 months ago

I have changed hypercorn to use 8 workers instead of 1 a few days ago and this seems to have helped this issue. The API has been without issue for multiple days now.

Kile commented 5 months ago

This issue is not resolved sadly. It is definitely a hypercorn issue. Increasing the number of workers only delays when the API starts timing out. I am looking into solutions.

Kile commented 1 month ago

This now may be resolved. While rewriting this API to rust, I believe I have found the root cause of this issue with the help of @y21.

The root cause was that zeromq, for some reason, in its default behaviour, prevents dropping pointers at the end of a function. So when my make_request function ends and everything up until that point worked as expected, it tries to drop the variables but is prevented continuously. image This means no error is raised but the code freezes at a low level which is insanely hard to trace.

Turns out this is default zmq behaviour but there thankfully is a method to change this behaviour. So a simple one line fixes this:

socket.set_linger(0)

That's it. That I what I have tried to find for 8 months. Hopefully this actually fixes it. I will keep this issue open for a bit, if I close it that was it.

Kile commented 1 month ago
image

Looking through the python implementation it is a bit harder to see because the linger argument will be passed to the underlying c implementation