Open ProKSMT opened 2 weeks ago
update diffusers to 0.30.1
I am using diffusers 0.30.1
can you try the code in cogvideox-dev
breanch with it requirement and try again with cli_demo.py
also, use breakpoint()
to locate the OOM code line,thks
I don‘t know how to use that to locate the OOM code line. Maybe this log will be helpful.
And I will test the cogvideox-dev
branch as soon as possible.
I had the same problem with V100, and it was solved by switching to A10. It seems to be a graphics card problem
I think so too. I found that V100 does not support bf16. I switched the dtype to fp16 and it worked (main branch). So I think it might not be necessary to test on the dev branch. However, I don't know exactly how the V100 leads to OOM just because it doesn't support bf16. Maybe the auto type conversion make the VRAM consumption multiply I guess.
I'm seeing this on my AMD RX 6900 XT. Changing the dtype does not have any effect, though. Could this have something to do with Flash Attention or Memory efficient attention support? I know that on my GPU neither of those work.
I think we need to try this issue. The 3060 desktop version has only 12G, but it can run the 5B model normally. However, there is feedback from developers that the V100 32G has problems running the 5B model, while the 2B model runs normally. I will check if it is a precision issue.
OK, I just tested it on the dev branch, and the same issue occurred. It also shows as 56.50G
Check if several key positions are open
Do not attempt to enable online quantization, this may cause errors on the GPU in this architecture.
Try to check several key nodes
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.float16)
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b"
text_encoder=text_encoder,
transformer=transformer
woe=woe,
torch_dtype=torch.float16
)
You must use FP16 on T4 unless you are using a GPU with Ampere or higher architecture that supports BF16 Additionally, do not use .to(device), as this allows for better compression on the CPU and memory, rather than transferring the entire complete model to the GPU.
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
I am already running normally on T4 on colab
Please check if this can help you
So it seems that the V100 can not run in BF16 mode. But It looks like that FP16 mode is not as good as BF16.
Will you release a special FP16 version of the 5B model ?
So it seems that the V100 can not run in BF16 mode. But It looks like that FP16 mode is not as good as BF16.
Will you release a special FP16 version of the 5B model ?
We tried, but the results weren’t ideal. The 5B model is currently recommended to run at BF16 precision, which is also the precision we used for training. Converting to FP16 leads to suboptimal performance. However, the 2B model has lower compatibility requirements and can run effectively in FP16.
use FP16 on T4 ,still error https://colab.research.google.com/drive/14TTaDTM3_lk69qKb5u4-1_gm_YK6lM3m?usp=sharing
# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
text_encoder=text_encoder,
transformer=transformer,
vae=vae,
torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
pipe.enable_sequential_cpu_offload() not work
why not work ,this should use with diffusers>=0.30.1 and using FP16 model,not INT8
why not work ,this should use with diffusers>=0.30.1 and using FP16 model,not INT8 i use diffusers 0.30.2;and not int8, you can see my code this link:https://colab.research.google.com/drive/14TTaDTM3_lk69qKb5u4-1_gm_YK6lM3m?usp=sharing
This is not right and we have upload a colab example friendly link in our readme
where is colab link? i can't see that, please send down, T4 can run?
Member 这生成时间需要1个多小时吗
There is no need, but it does take a long time (in my T4 colab similar code it takes about 20 minutes, this is due to the computational power limitations of this generation of GPUs, and to compress memory usage, a lot of time-for-space solutions are used, resulting in very slow speed. Additionally, the T4 cannot run BF16 models, and the quality of FP16 inference cannot be guaranteed, so we recommend using newer GPUs for inference and fine-tuning.
takes 1hour
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.50 GiB.
V100 32G
5B model,
enable_model_cpu_offload()
option andpipe.vae.enable_tiling()
optimization were enabledusing diffusers (cli_demo.py)