Machine hangs after performing a few jobs

betterthanever2 commented 5 days ago

I have successfully set up worker on my home machine. It's GeForce RTX 4070. OS is Ubuntu. I run the worker in a Docker container.

When I start the app, all goes fine for maybe 10 minutes. Jobs are picked up, kudos are calculated, all the jazz. Then I notice one or several jobs fail, which looks something like this in the logs:

reGen  | 2024-11-16 16:49:05.000 | INFO     | [HWRPM]:api_job_pop:3542 - Popped job 77325d6d-170c-49cd-b09f-d268104ad037 (5 eMPS) (model: ICBINP - I Can't Believe It's Not Photography)
reGen  | 2024-11-16 16:49:05.009 | INFO     | [HWRPM]:api_job_pop:3576 - Job queue: <c46cab9e-ad16-4c2f-82b4-3b2be3e73792: AlbedoBase XL (SDXL)>, <77325d6d-170c-49cd-b09f-d268104ad037: ICBINP - I Can't Believe It's Not Photography>
reGen  | 2024-11-16 16:49:09.461 | INFO     | [HWRPM]:receive_and_handle_process_messages:1796 - Inference finished for job c46cab9e-ad16-4c2f-82b4-3b2be3e73792 on process 2. It took 11.74 seconds and reported 0 faults.
reGen  | 2024-11-16 16:49:09.461 | ERROR    | horde_worker_regen.process_management.process_manager:receive_and_handle_process_messages:1811 - Job c46cab9e-ad16-4c2f-82b4-3b2be3e73792 faulted on process 2: Inference result
reGen  | 2024-11-16 16:49:09.464 | INFO     | [HWRPM]:receive_and_handle_process_messages:1752 - Process 2 unloaded model AlbedoBase XL (SDXL)
reGen  | 2024-11-16 16:49:09.465 | ERROR    | horde_worker_regen.process_management.process_manager:api_submit_job:2754 - Job [c46cab9e-ad16-4c2f-82b4-3b2be3e73792] has no state, assuming faulted
reGen  | 2024-11-16 16:49:09.465 | ERROR    | horde_worker_regen.process_management.process_manager:api_submit_job:2758 - Job [c46cab9e-ad16-4c2f-82b4-3b2be3e73792] faulted, removing from completed jobs after submitting the faults to the horde
reGen  | 2024-11-16 16:49:09.465 | ERROR    | horde_worker_regen.process_management.process_manager:submit_single_generation:2544 - Job c46cab9e-ad16-4c2f-82b4-3b2be3e73792 has no image result

(this may or may not be relevant to the main issue).

These errors, however, are not fatal, and the app goes on. Until (I'm assuming) memory leaks and the machine just freezes. Here's the last lines in my docker logs:

reGen  | 2024-11-16 16:51:42.410 | INFO     | [HWRPM]:api_job_pop:3542 - Popped job b904b10b-4926-41cd-9c13-fb99da4aadea (7 eMPS) (model: AlbedoBase XL (SDXL))
reGen  | 2024-11-16 16:51:42.410 | INFO     | [HWRPM]:api_job_pop:3576 - Job queue: <b904b10b-4926-41cd-9c13-fb99da4aadea: AlbedoBase XL (SDXL)>
reGen  | 2024-11-16 16:51:42.437 | INFO     | [HWRPM]:start_inference:2199 - Starting inference for job b904b10b-4926-41cd-9c13-fb99da4aadea on process 2
reGen  | 2024-11-16 16:51:42.438 | INFO     | [HWRPM]:start_inference:2204 - Model: AlbedoBase XL (SDXL)
reGen  | 2024-11-16 16:51:42.438 | INFO     | [HWRPM]:start_inference:2236 - 512x512 for 30 steps with sampler k_euler for a batch of 1
reGen  | 2024-11-16 16:51:42.539 | INFO     | [HWRPM]:print_status_method:4061 - Process info:
reGen  | 2024-11-16 16:51:42.539 | INFO     | [HWRPM]:print_status_method:4063 - Process 0: (SAFETY) WAITING_FOR_JOB 
reGen  | 2024-11-16 16:51:42.539 | INFO     | [HWRPM]:print_status_method:4063 - Process 1 (WAITING_FOR_JOB):  (ICBINP - I Can't Believe It's Not Photography [last event: 2.03 secs ago: START_INFERENCE]
reGen  | 2024-11-16 16:51:42.539 | INFO     | [HWRPM]:print_status_method:4063 - Process 2 (INFERENCE_STARTING):  (AlbedoBase XL (SDXL) [last event: 0.0 secs ago: START_INFERENCE]
reGen  | 2024-11-16 16:51:42.539 | INFO     | [HWRPM]:print_status_method:4066 - dreamer_name: DreemingWorker | (v9.2.1) | horde user: shoomow#327573 | num_models: 2 | max_power: 16 (724x724) | max_threads: 1 | queue_size: 1 | safety_on_gpu: True
reGen  | 2024-11-16 16:51:42.540 | INFO     | [HWRPM]:print_status_method:4121 - Jobs: <b904b10b-4926-41cd-9c13-fb99da4aadea: AlbedoBase XL (SDXL)>
reGen  | 2024-11-16 16:51:42.540 | INFO     | [HWRPM]:print_status_method:4129 - Active models: {'AlbedoBase XL (SDXL)', "ICBINP - I Can't Believe It's Not Photography"}
reGen  | 2024-11-16 16:51:42.540 | SUCCESS  | [HWRPM]:print_status_method:4145 - Session job info: currently popped: 1 (eMPS: 7) | submitted: 60 | faulted: 0 | slow_jobs: 0 | process_recoveries: 0 | 160.36 seconds without jobs
reGen  | 2024-11-16 16:51:42.684 | INFO     | [HWRPM]:api_job_pop:3528 - No job available. Current number of popped jobs: 1. (Skipped reasons: {'bridge_version': 0, 'models': 231, 'nsfw': 0, 'performance': 0, 'untrusted': 116, 'worker_id': 5, 'max_pixels': 221, 'lora': 100})
reGen  | 2024-11-16 16:51:44.834 | INFO     | [HWRPM]:api_job_pop:3528 - No job available. Current number of popped jobs: 1. (Skipped reasons: {'bridge_version': 0, 'models': 232, 'nsfw': 0, 'performance': 0, 'untrusted': 115, 'worker_id': 5, 'max_pixels': 222, 'lora': 100})
reGen  | 2024-11-16 16:51:44.989 | SUCCESS  | [HWRPM]:log_kudos_info:3694 - Total Session Kudos: 1,975.02 over 14.17 minutes | Session: 8,364.08 (extrapolated) kudos/hr
reGen  | 2024-11-16 16:51:44.989 | INFO     | [HWRPM]:log_kudos_info:3698 - Total Kudos Accumulated: 2,621.00 (all workers for shoomow#327573)
reGen  | 2024-11-16 16:51:45.118 | INFO     | [HWRPM]:api_job_pop:3528 - No job available. Current number of popped jobs: 1. (Skipped reasons: {'bridge_version': 0, 'models': 230, 'nsfw': 0, 'performance': 0, 'untrusted': 115, 'worker_id': 5, 'max_pixels': 221, 'lora': 99})
reGen  | 2024-11-16 16:51:46.336 | INFO     | [HWRPM]:api_job_pop:3528 - No job available. Current number of popped jobs: 1. (Skipped reasons: {'bridge_version': 0, 'models': 228, 'nsfw': 0, 'performance': 0, 'untrusted': 115, 'worker_id': 5, 'max_pixels': 220, 'lora': 99})
reGen  | 2024-11-16 16:51:47.753 | INFO     | [HWRPM]:api_job_pop:3528 - No job available. Current number of popped jobs: 1. (Skipped reasons: {'bridge_version': 0, 'models': 228, 'nsfw': 0, 'performance': 0, 'untrusted': 116, 'worker_id': 5, 'max_pixels': 218, 'lora': 101})

So, the exception, does not register in the logs, and I'm not sure how this can be fixed. I tried reducing params in the config, specifically:

I set max_power to 16. Initially it was 32
I set unload_models_from_vram_often to true
there is some setting about overlapping, which I foolishly set to true initially, and then changed back to false

Any recommendations on how to diagnose this?

HPPinata commented 3 days ago

Have you taken a look at your system memory usage (and potentially swap usage)? And how much memory do you have installed? We are currently diagnosing excessive memory usage on Linux systems that could cause these issues.

betterthanever2 commented 3 days ago

@HPPinata I have 32Gb of memory and 8Gb swap. Videocard has 12Gb. Is there a way to monitor memory usage while running the worker?

HPPinata commented 3 days ago

I'd just take a look at htop, it should be pretty obvious if it's using enough memory to slow down or crash. btop has a somewhat nicer interface and also shows GPU usage and memory, but what we are tracking down appears to be CPU / system RAM side.

I think memory usage is also logged (and the log directory should be mapped to wherever you're running the container from as long as you are using the standard compose configuration).

We've seen issues on 64GB and even 128GB machines (though that's definitely not intended and should hopefully be fixed at some point) so a 32GB system might be even more susceptible.

betterthanever2 commented 3 days ago

UPD: at first mistakingly put available memory instead of free.

Below is a screenshot of the final state. It started with ~7Gb of used memory and eventually went up to > 30Gb. It could be that non-fatal errors of the output that occur from time to time coincide with free memory dropping below 300 Mb.

photo_2024-11-18_21-05-28

HPPinata commented 3 days ago

Yep, that looks like the memory issue we are having. There's some more details on the discord, but as of now all we know is that something is committing, but not really using a lot of memory. We're not really sure what, but the fact it happens on both AMD and NVIDIA means it's either a PyTorch library, ComfyUi, our usage of their internal methods, or the worker itself. Based on the windows numbers 32GB RAM (+ a bit of swap) should be more than enough, but it isn't right now.

You can try setting threads: 1 and queue: 0, that should at least limit growth to one thread.

In the extreme only offering one model should eliminate the loading/unloading behavior that causes the issue to worsen, but in that regard you'll have to try around a bit.

betterthanever2 commented 3 days ago

Ok, from what you're saying, I can conclude that the issue is known, the team is working on it, and fix would be provided at some point.

I'll leave it up to you whether to keep this issue opened (but it could be a good way of infoming the interested about future fix)

HPPinata commented 3 days ago

It's related to #316, but I'll have to get around to updating it to reflect the current extend of the issue (and our knowledge so far). I think at least until then it can stay open, if only as a reminder to merge all the memory Linux OOM stuff into one.

Haidra-Org / horde-worker-reGen

Machine hangs after performing a few jobs #350