Haidra-Org / horde-worker-reGen

The default client software to create images for the AI-Horde
https://aihorde.net/
GNU Affero General Public License v3.0
93 stars 42 forks source link

Memory issues with AMD cards #316

Open HPPinata opened 1 month ago

HPPinata commented 1 month ago

Somewhere within the ROCm stack, a library, ComfyUi or the reGen worker is a severe memory leak.

It seems to be triggered by loading and unloading models, not the actual compute. When multiple models are offered after (almost) every swap on which of them are active (in VRAM) or preloaded (in RAM) a few more GB of system RAM are used.

I first noticed this after no-vram was merged https://github.com/Haidra-Org/horde-worker-reGen/compare/v8.1.2...v9.0.2 There the behavior changed from a relatively static ~17GB per queue/thread (also quite a lot, increasing over time before leveling off) to gradually hogging more and more RAM over time (as much as 40GB!! for just one thread). If a worker thread got killed and restarted it's usage was reset, but the worker wasn't always able to recover.

VRAM usage on the other hand had gotten a lot better, going from 15-20+ GB (even on SDXL and SD1.5) to 5-15 GB (depending on model) with the only 20 GB occasions being FLUX1 (this would be somewhat expected).

Depending on loaded models, job mix, etc. the worker (even with 64GB of RAM) becomes unusable after 15-45 min. Only loading 1 model seems to fix (or at least help) with this. Impact of LoRa and controlnet is still unclear, but just disabling them doesn't magically fix things.

A clarification on expected behavior (when and how is memory supposed to be used) would be helpful. Is the worker supposed to keep the active model in system RAM, even though it has already been transferred to VRAM? Is the memory usage of a thread supposed to go down when switching from a heavier model (SDXL/FLUX) to a lighter one (SD1.5)?

I'll keep doing more testing, and will also do some comparisons to the ComfiUi behavior once I have time, but that might take a while (I'm not familiar with comfi yet, let alone with how to debug it).

System:

HPPinata commented 1 month ago

I think i can rule out the kernel and AMD driver since it happened with a lot of different versions, and even in a WSL environment with the paravirtualized Windows driver. The behavior also seems consistent over different ROCm versions, though in that regard I still have a few combinations to examine. I also took a look at open process file handles. There was some accumulation of deleted /tmp entries over time, but only a couple MB worth, nowhere near the tens of GB unaccounted for.

I'm not a developer, so as soon as things get into the Python/PyTorch realm I'm out of my comfort zone and the best I can do is try random stuff and see what happens (most of the time nothing, because I broke something I didn't understand in the first place). Ideas on how to debug things more effectively are very welcome.

tazlin commented 1 month ago

A clarification on expected behavior (when and how is memory supposed to be used) would be helpful. Is the worker supposed to keep the active model in system RAM, even though it has already been transferred to VRAM? Is the memory usage of a thread supposed to go down when switching from a heavier model (SDXL/FLUX) to a lighter one (SD1.5)?

The current implementation as of the raw-png branch can be expected to use up all available system ram for some period of time. I generally encourage worker operators to only run the worker and no other memory/GPU intensive applications beyond a browser. The matter of what counts as "excessive memory usage" is up for debate, I suppose, but it would constitute a feature request out-of-scope from your bug report to change that behavior at this point.

However, if the general trend of memory usage grows arbitrarily over time and never levels off or if the floor memory usage continually grows, that's very like a bug. If I had to guess, the root of the problem for ROCm might be here: https://github.com/Haidra-Org/hordelib/blob/main/hordelib/comfy_horde.py#L433:L436

_comfy_soft_empty_cache is an aliased call to the following: https://github.com/comfyanonymous/ComfyUI/blob/e5ecdfdd2dd980262086b0df17cfde0b1d505dbc/comfy/model_management.py#L1077

The cryptic comment here by that developer suggests I'm probably abusing this function though the reason is left unclear, as it only notes that it "makes things worse for ROCm". You could modify the file directly in your site-packages dir of your venv to set this to false manually in horde-engine and see if that changes anything for you. In the short-medium term, especially if you are able to validate that this changes the dynamics of the problem, I could include sending a value as appropriate if the worker is started with a ROCm card. However, if I had to guess, the problem is more complicated than a simple flag flip.

I would just like to take a moment and emphasize that the worker's use case of ComfyUI (via horde-engine, formerly hordelib) is not officially supported by the comfy team in any way. Specifically, ComfyUI makes assumptions about the state of the machine based on sunny-day memory/vram conditions and does not anticipate any other high-memory usage applications. However, in the worker use case we spawn N ComfyUI (i.e., high memory usage applications) instances which shatters many of these built in expectations. Its therefore very much a guessing game for me as well to support the huge number of permutations of system configurations of our users have while still trying to understand the massive codebase that ComfyUI is, which is also constantly changing, and which often makes changes that are fundamentally contrary to the worker use case. This is of course not their fault but it is quite difficult hitting moving targets.

Historically, we were only ever able to support CUDA cards and so for that and other reasons that I am sure are becoming obvious to you, I will readily admit support for ROCm is lacking.

The truth of the matter is that AMD support has relied entirely on volunteers running under certain conditions of their choosing, sending me logs, and me attempting to optimize based on that alone. I have had very little hands-on time with an AMD/RoCM card to nail down these issues and only a few willing volunteers. If you are willing or able, we could have a more interactive conversation in the official AI-Horde discord server, found here: https://discord.gg/r7yE6Nu8. This conversation would be appropriate for the #local-workers channel, where you could feel free to ping me (@tazlin on discord).

In any event, I can see you've clearly put sincere thought and energy into your recent flurry of activity and I appreciate the work you've put in so far. Feel free to continue here if needed or reach out to me in discord as mentioned above.

tazlin commented 1 month ago

And just as an aside, I would encourage you to ensure you have a reasonable amount of swap configured on your system, as it has been shown to defray some of the memory related issues at time. I do suspect that it wouldn't be a perfect silver bullet but if you had little or none configured, I would at least try adding some.

HPPinata commented 1 month ago

And just as an aside, I would encourage you to ensure you have a reasonable amount of swap configured on your system, as it has been shown to defray some of the memory related issues at time. I do suspect that it wouldn't be a perfect silver bullet but if you had little or none configured, I would at least try adding some.

I have a 1:1 ratio of memory to swap, and I've seen 15+GB being used (with some actual disk activity, so it's not just sitting there)

The cryptic comment here by that developer suggests I'm probably abusing this function though the reason is left unclear, as it only notes that it "makes things worse for ROCm". You could modify the file directly in your site-packages dir of your venv to set this to false manually in horde-engine and see if that changes anything for you. In the short-medium term, especially if you are able to validate that this changes the dynamics of the problem, I could include sending a value as appropriate if the worker is started with a ROCm card. However, if I had to guess, the problem is more complicated than a simple flag flip.

I tested around a bit, nothing new so far. But knowing where the load/unload is happening already helps a bit in narrowing down what I'm searching for.

Something interesting I found was, that the process apparently thinks it has VAST amounts of virtual system memory available (or the units on that field are completely different to the VRAM one):

2024-10-07 10:30:14.041 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 1 memory report: ram: 27614076928 vram: 13665 total vram: 24560
2024-10-07 10:30:14.042 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 1 memory report: ram: 27614076928 vram: 13665 total vram: 24560
2024-10-07 10:30:14.146 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 0 memory report: ram: 1274597376 vram: None total vram: None
2024-10-07 10:30:18.081 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040344576 vram: 6297 total vram: 24560
2024-10-07 10:30:18.082 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040344576 vram: 6297 total vram: 24560
2024-10-07 10:30:18.201 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 0 memory report: ram: 1278730240 vram: None total vram: None
2024-10-07 10:30:18.201 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040213504 vram: 6331 total vram: 24560
2024-10-07 10:30:18.202 | DEBUG    | horde_worker_regen.process_management.process_manager:on_memory_report:380 - Process 2 memory report: ram: 26040213504 vram: 6331 total vram: 24560

I'll hop on the Discord later, I just wanted to open this issue here first (as a reference), working with those long form texts over there is a bit tedious (handle on Discord is @Momi_V)

HPPinata commented 1 month ago

I've done some more testing: Running just one thread (queue_size: 0) the memory usage appears to stabilize between 40GB and 45GB. AMD GO FAST (flash_attn) appears to have some interaction and without it usage is closer to 20GB. I'll report back with more data. The CPU memory values appear to be using Bytes instead of MB, but are otherwise consistent with btop