Arksine / moonraker

Web API Server for Klipper
https://moonraker.readthedocs.io
GNU General Public License v3.0
1.02k stars 392 forks source link

Loading gcode files causes fluidd to lose connection to moonraker #795

Closed samwiseg0 closed 5 months ago

samwiseg0 commented 5 months ago

What happened

On the latest version it appears there is some timeout that is triggered that causes fluidd to lose communication to moonraker.

This happens when trying to load the list of gcode files. When opening a folder where the gocde files are, the preview image may load on a few gcode files and then the interface hangs and becomes unresponsive.

To test, I rolled back to commit, https://github.com/Arksine/moonraker/commit/6d8cb762ffca1bd151d98e0cec5b66b966965014, using the command git reset --hard 6d8cb762ffca1bd151d98e0cec5b66b966965014 and everything works as expected.

This is happening all 3 of my printers. All of them are on the latest commits.

Client

Fluidd

Browser

Chrome, Safari, Other or N/A

How to reproduce

Run the latest version. I am using authz but I am unsure if that is related to this issue. Navigate to jobs and open a folder containing gcode files and the interface will become unresponsive.

Additional information

moonraker-5.log

Seeing this repeatedly in the logs.

2024-01-21 12:08:31,111 [proc_stats.py:_watchdog_callback()] - EVENT LOOP BLOCKED: 20.31 seconds, total blocked count: 21

Edit: I did a bit more testing. Looks like this maybe Authz specific. Adding my machine as a trusted client and connecting directly to the pi without a reverse proxy. It appears to be working just fine. So this maybe related to the regression fix that was implemented in https://github.com/Arksine/moonraker/commit/a23187b4afa632da8a90cb9eae83f0f6a5eec903.

I believe I found the commit that causes this issue after walking back a few commits at a time. https://github.com/Arksine/moonraker/commit/b3b60757aa0d06eac4c3395ac04a8a8b66e76f60

Arksine commented 5 months ago

Thanks. I'm working on tracking this down but thus far have been unable to reproduce. Can you add the following line to ~/printer_data/systemd/moonraker.env?

MOONRAKER_VERBOSE_LOGGING="y"

Then restart Moonraker, repro the issue, and upload the log. This will create large logs, but will hopefully help me track down where the blocking issue is occurring (and how its related to authorization, it really shouldn't be).

Arksine commented 5 months ago

Update:

Even though I can't reproduce I think I have found the source of the problem. There is a potential blocking call to socket.getfqdn in the authorization module that was generally bypassed previously. Its a call that is very rarely reached and normally it returns immediately. I'll work on a fix. You can hold off on the verbose logging right now, as I am reasonably confident that this is the problem.

samwiseg0 commented 5 months ago

Update:

Even though I can't reproduce I think I have found the source of the problem. There is a potential blocking call to socket.getfqdn in the authorization module that was generally bypassed previously. Its a call that is very rarely reached and normally it returns immediately. I'll work on a fix. You can hold off on the verbose logging right now, as I am reasonably confident that this is the problem.

This makes sense. I discovered late last night that my first DNS server was unresponsive so all queries were timing out on the first server and then heading to the second DNS server.

Arksine commented 5 months ago

This should be resolved in d1f97f2658a4e003da95c131c87ce4c59e86de8f. In addition to running getfqdn() in a thread to prevent blocking, it will now only be called if a domian name is configured in trusted_clients.

samwiseg0 commented 5 months ago

Awesome! I will simulate the DNS failure and test before and after this commit and report back.

Thanks again for looking into this so quickly!

samwiseg0 commented 5 months ago

As best as I could test. I am mostly confident this resolved the issue.

When doing my testing I was not able to have fluidd lose connection entirely but it would freeze for a few seconds loading gcode file images. After your change there was no delay. With or without the DNS server being down.

Arksine commented 5 months ago

Great, thanks. The new changes should avoid DNS lookups entirely given your configuration, which is what we want. It is likely that DNS resolution was the cause of the delay prior, even when the server was up.