Why does AutoencodingEngine need to be reloaded every time it runs?

phoenixor commented 2 weeks ago

Your question

I use an Ubuntu server. The hardware information is : cuda:0 Tesla T4 : cudaMallocAsync 15G I use the official sample workflow. The loader used is as follows.

t5xxl_fp8_e4m3fn.safetensors
flux1-dev-fp8.safetensors
clip_l.safetensors
ae.safetensors

every time i run the workflow, it will reload model AutoencodingEngine and takes 150 seconds to ouput the image. How can i fix this problem?

Logs

No response

Other

No response

ltdrdata commented 1 week ago

There isn't enough space in your VRAM to load both t5 and flux simultaneously. If you modify the prompt, t5 needs to be loaded again to recalculate the conditioning. At this point, VAE and FLUX are unloaded to free up VRAM. If you skip the ksampling step and only create conditioning consecutively, there's no need to reload t5. And if you only perform the steps after KSampling using the pre-created conditioning, the phenomenon of VAE being reloaded will also disappear.

The long loading time suggests that your RAM is insufficient, causing swapping to occur. If both t5 and flux were properly loaded in RAM, switching between them would be instantaneous.

AbstractEyes commented 1 week ago

I'm getting this without the prompt being modified. The thing reloads each time as of the current build using Flux. Loads, spins up, finishes, unloads everything. Has to reload slowly yet again every single time I start it back up and I'm running it on the windows build with a 4090. I'm capping at about 18 out of 24 gigs vram and it does it anyway.

ltdrdata commented 1 week ago

I'm getting this without the prompt being modified. The thing reloads each time as of the current build using Flux. Loads, spins up, finishes, unloads everything. Has to reload slowly yet again every single time I start it back up and I'm running it on the windows build with a 4090. I'm capping at about 18 out of 24 gigs vram and it does it anyway.

What is your --reserved-vram setting?

AbstractEyes commented 1 week ago

I'm getting this without the prompt being modified. The thing reloads each time as of the current build using Flux. Loads, spins up, finishes, unloads everything. Has to reload slowly yet again every single time I start it back up and I'm running it on the windows build with a 4090. I'm capping at about 18 out of 24 gigs vram and it does it anyway.

What is your --reserved-vram setting?

Whatever the default is.

phoenixor commented 1 week ago

There isn't enough space in your VRAM to load both t5 and flux simultaneously. If you modify the prompt, t5 needs to be loaded again to recalculate the conditioning. At this point, VAE and FLUX are unloaded to free up VRAM. If you skip the ksampling step and only create conditioning consecutively, there's no need to reload t5. And if you only perform the steps after KSampling using the pre-created conditioning, the phenomenon of VAE being reloaded will also disappear.

The long loading time suggests that your RAM is insufficient, causing swapping to occur. If both t5 and flux were properly loaded in RAM, switching between them would be instantaneous.

This is the memory allocation situation of the ComfyUI application in my Ubuntu server. total used free shared buff/cache available Mem: 125Gi 29Gi 7.8Gi 29Mi 88Gi 95Gi Swap: 8.0Gi 28Mi 8.0Gi

What do I need to do to avoid reloading the AutoencodingEngine model?

ltdrdata commented 1 week ago

I'm getting this without the prompt being modified. The thing reloads each time as of the current build using Flux. Loads, spins up, finishes, unloads everything. Has to reload slowly yet again every single time I start it back up and I'm running it on the windows build with a 4090. I'm capping at about 18 out of 24 gigs vram and it does it anyway.

It's flux fp8, and Is it correct that this phenomenon occurs simply by changing the seed?

AbstractEyes commented 1 week ago

Collaborator

No. Happens even when I leave the seed the same and have unfinished generations. The vram visually drops to nearly nothing and then loads about half instantly so about 10 gigs. After that it spins up the next 12 gigs until I'm at about 20-22 gigs vram total, then when it's done it dumps it all again.

The flux lora I'm training and testing currently does use both UNET and CLIP_L blocks, so that probably matters. I didn't train the T5, but there may be some sort of complexity I don't understand here. My guess is it has some sort of problem when loading the clip + unet due to T5 load/unload optimizations. Forge doesn't have any problems with it though.

AbstractEyes commented 1 week ago

So I noticed it wasn't happening with ALL other models. Primarily this one. When I load fp8 unet with the fp16 xxl as standalone it doesn't need to reload every time I change seeds.

https://civitai.com/models/637170/flux1-compact-or-clip-and-vae-included

Unless something was patched to make it more stable, in which case I'll hold my tongue.

ltdrdata commented 1 week ago

So I noticed it wasn't happening with ALL other models. Primarily this one. When I load fp8 unet with the fp16 xxl as standalone it doesn't need to reload every time I change seeds.

https://civitai.com/models/637170/flux1-compact-or-clip-and-vae-included

Unless something was patched to make it more stable, in which case I'll hold my tongue.

In the case of fp16, the diffusion model alone already reaches 23.8GB, so it may be freeing up VRAM space to load the VAE. There is a shortage of VRAM due to consumption by other apps such as browsers. Additionally, ComfyUI attempts to secure a bit of extra VRAM to cover VRAM consumption that isn't accurately captured in its measurements.

AbstractEyes commented 1 week ago

So I noticed it wasn't happening with ALL other models. Primarily this one. When I load fp8 unet with the fp16 xxl as standalone it doesn't need to reload every time I change seeds. https://civitai.com/models/637170/flux1-compact-or-clip-and-vae-included Unless something was patched to make it more stable, in which case I'll hold my tongue.

In the case of fp16, the diffusion model alone already reaches 23.8GB, so it may be freeing up VRAM space to load the VAE. There is a shortage of VRAM due to consumption by other apps such as browsers. Additionally, ComfyUI attempts to secure a bit of extra VRAM to cover VRAM consumption that isn't accurately captured in its measurements.

This isn't an fp16 problem. Even when loading both unet fp8 and t5xxl_fp8 I end up with this same problem. It unloads both, and it's still doing it. It's gotten to the point where I might as well just switch to forge full time, since it actually works and this one is just patching itself into a laggier and laggier state.

I'm a fervent advocate for comfy but if it's just going to not do it's job and I'm being told it's a configuration problem, after some random patch that happened... Cmon I have a 4090 and 64 gigs of ram. This shouldn't be happening. This hardware is basically windows choice. The thing is sitting on an m.2 ssd, I have a 12 core cpu, I have more than enough power supplying the thing, and I'm running windows 10.

Lets be realistic. Why should I continue to advocate and use this product if it's multiplied the inference time by 8?

comfyanonymous commented 1 week ago

This was fixed like 2 months ago. Try downloading a fresh version of the latest standalone package.

If you still have issues at least post the logs.

comfyanonymous / ComfyUI