Open betterthanever2 opened 5 days ago
Have you taken a look at your system memory usage (and potentially swap usage)? And how much memory do you have installed? We are currently diagnosing excessive memory usage on Linux systems that could cause these issues.
@HPPinata I have 32Gb of memory and 8Gb swap. Videocard has 12Gb. Is there a way to monitor memory usage while running the worker?
I'd just take a look at htop, it should be pretty obvious if it's using enough memory to slow down or crash. btop has a somewhat nicer interface and also shows GPU usage and memory, but what we are tracking down appears to be CPU / system RAM side.
I think memory usage is also logged (and the log directory should be mapped to wherever you're running the container from as long as you are using the standard compose configuration).
We've seen issues on 64GB and even 128GB machines (though that's definitely not intended and should hopefully be fixed at some point) so a 32GB system might be even more susceptible.
UPD: at first mistakingly put available
memory instead of free
.
Below is a screenshot of the final state. It started with ~7Gb of used
memory and eventually went up to > 30Gb. It could be that non-fatal errors of the output that occur from time to time coincide with free
memory dropping below 300 Mb.
Yep, that looks like the memory issue we are having. There's some more details on the discord, but as of now all we know is that something is committing, but not really using a lot of memory. We're not really sure what, but the fact it happens on both AMD and NVIDIA means it's either a PyTorch library, ComfyUi, our usage of their internal methods, or the worker itself. Based on the windows numbers 32GB RAM (+ a bit of swap) should be more than enough, but it isn't right now.
You can try setting threads: 1
and queue: 0
, that should at least limit growth to one thread.
In the extreme only offering one model should eliminate the loading/unloading behavior that causes the issue to worsen, but in that regard you'll have to try around a bit.
Ok, from what you're saying, I can conclude that the issue is known, the team is working on it, and fix would be provided at some point.
I'll leave it up to you whether to keep this issue opened (but it could be a good way of infoming the interested about future fix)
It's related to #316, but I'll have to get around to updating it to reflect the current extend of the issue (and our knowledge so far). I think at least until then it can stay open, if only as a reminder to merge all the memory Linux OOM stuff into one.
I have successfully set up worker on my home machine. It's GeForce RTX 4070. OS is Ubuntu. I run the worker in a Docker container.
When I start the app, all goes fine for maybe 10 minutes. Jobs are picked up, kudos are calculated, all the jazz. Then I notice one or several jobs fail, which looks something like this in the logs:
(this may or may not be relevant to the main issue).
These errors, however, are not fatal, and the app goes on. Until (I'm assuming) memory leaks and the machine just freezes. Here's the last lines in my docker logs:
So, the exception, does not register in the logs, and I'm not sure how this can be fixed. I tried reducing params in the config, specifically:
max_power
to 16. Initially it was 32unload_models_from_vram_often
to trueAny recommendations on how to diagnose this?