Major Memory Management Issue

altoiddealer commented 6 months ago

Helllo, I'm suddenly having some big problems with my memory when using SD-Webui-Forge at the same time as All-Talk.

The most important thing to note is that I am not even using textgenwebui to invoke the image generations, I am simply running both processes, but using SD Forge UI.

Until today, images would generate within seconds.

Today, IF I have alltalk_tts enabled it adds several minutes to each image.

When I close textgenwebui, it goes right back to taking only a few seconds!

If I launch textgenwebui without alltalk_tts, it still only takes a few seconds!

Launch with alltalk enabled... several minutes.

It is extremely baffling to me.

Desktop (please complete the following information): AllTalk was updated: Today 3/11/2024 Custom Python environment: No Text-generation-webUI was updated: Today 3/11/2024

altoiddealer commented 6 months ago

Nevermind, there's just something really wonky happening and it is very sporatic.

Apparently not alltalk_tts causing this

erew123 commented 6 months ago

Hi

I know you've closed this and I'll be honest it's probably a very complicated one to investigate for any of the Devs between text-gen-webui, forge and more specifically me personally.

I know text-gen-webui has just moved up to PyTorch 2.2.1 last week and that has new memory management and "improvements". Forge does some clever things with memory management and processing in your GPU, and I have no idea how well it's memory management may interplay with with other things and also potentially swapping things in and out of vram (forge seems to be doing some extra special calls with memory management).

If you are continuing to have issues, id be tempted to politely ask the Forge dev how well he thinks the memory management and advanced processing methods may work with other things such as text-gen-webui running (or other AI software). I suggest speaking to them as they clearly have a very good grasp on cuda, python and certain techniques for performance and they are more likely to have an inkling as to what you may want to try first.

Id certainly give them any details of any of the advanced command lines features you use within forge and see what they think about possible interplay.

In theory, everything should work pretty happily together as long as each bit of software is making one call at a time and also you have enough ram/vram to move things around smoothly.

As it goes I tested AllTalk today with normal Stable Diffusion and Text-gen-webui (all fresh installs) and didn't experience any issues on 30 minutes of texting. That's not me pointing the finger at Forge as the issue. It could be many things, even just your system. I'm just saying I tested it today and didn't incur a problem.

Good luck though.

altoiddealer commented 6 months ago

Thank you for your helpful reply - what happened is that I did repeatedly test turning things on and off, and alltalk_tts was the variable consistently causing the problem.

...until, one time while it was disabled an image generated at normal speed. I only had a minute to rush back and close the case before the office closed.

Your response did make me think of something, though, which is that I may have slipped up with the nVidia CUDA memory fallback setting, after having reinstalled textgen - this will be the first thing I check... after, may need to roll back some commits on forge and see what happens.

In any case, I've got a bit of tinkering to do to root out the true cause, and I'll report back when I do to satiate any curiosity :)

erew123 commented 6 months ago

I can make one other recommendation for you. Enable the LowVRAM mode on AllTalk. If its NOT enabled, the TTS model will want to stay in your GPU's VRAM. If its enabled, it will move the model between System RAM and VRAM as/when necessary (check the built in documentation for a more technical explanation).

You could be getting a situation where the models (your LLM, Stable Diffusion and the XTTS) are getting fragmented between actual VRAM and extended VRAM (which runs on system RAM). Its a long technical explanation this one, but a nutshell version is this is part of the Nvidia Drivers doing on windows. There is no need to change THIS setting listed in this document, it just details the behavior https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion#:~:text=The%20switch%20to%20use%20shared,GPUs%2C%20reducing%20the%20application%20speed.

Forge, like AllTalk has options to move things around between VRAM and System RAM as needed.

So I would try AllTalk's lowvram setting is on. I would also check that Forge is set to move models off to RAM as needed.

Thanks

altoiddealer commented 6 months ago

You are certainly too kind - this is some great info you’ve given. Thanks!

altoiddealer commented 6 months ago

I visited my nVidia Control Panel, and the settings overrides I had previously configured for CUDA Sysmem Fallback were indeed gone - apparently, they are automatically removed when the system path is no longer valid, requiring to be added in again.

The issue was resolved by disabling that policy for my 3 Python instances (Forge seems to have 2 instances).

You'd think that what was making my inference grind to a crawl would now trigger OOM, but by whatever voodoo everything is now just purring along in peaceful harmony as it was before.

Thanks again - it was simple and I should have had better awareness - your comment did make me think of that as possibly being the culprit.

erew123 commented 6 months ago

Great to hear! Glad to hear you have it sorted!

311-code commented 6 months ago

Thanks a lot for that info altoiddealer.

I was having the exact same problem with sd forge, basically freezing after the context length of chat got going a bit, and the chat would also take forever. It was only when alltalk was activated.

I did what you said and it worked! Will go into detail real fast in case anyone also having this: In nvidia control panel, manage 3d settings, then program settings, click add.

Add the python.exe from sd forge's venv\scripts folder and set "CUDA - system fallback policy" to Prefer no system fallback. Then add all the python.exe's that were in text-geberation-webui-main (I had no idea which ones we actually used so just added all of them) and set them all to prefer no system fallback.

I have a 4090 and have no idea why it's not crashing anymore and runs fast. I'm able to use a higher bpw model now also.

You would think this would cause an OOM error. I don't even have to use low vram setting or deep seed either so voice sounds more accurate. Now I no longer have to buy a second PC haha. Thanks again.

altoiddealer commented 6 months ago

God only knows how many people on the “newer drivers” are losing their minds over the memory handling, unable to pinpoint it to new nVidia feature that does more harm than good for most users.

Edit just wanted to mention it’s not specific to Forge, would happen the same when VRAM total gets too high from all sources

erew123 / alltalk_tts

Major Memory Management Issue #117