THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.21k stars 664 forks source link

Running CogVideoX-5B on T4/V100 Free Colab Space #204

Open ProKSMT opened 2 weeks ago

ProKSMT commented 2 weeks ago

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.50 GiB.

V100 32G

5B model, enable_model_cpu_offload() option and pipe.vae.enable_tiling() optimization were enabled

using diffusers (cli_demo.py)

zRzRzRzRzRzRzR commented 2 weeks ago

update diffusers to 0.30.1

ProKSMT commented 2 weeks ago

I am using diffusers 0.30.1

zRzRzRzRzRzRzR commented 2 weeks ago

can you try the code in cogvideox-devbreanch with it requirement and try again with cli_demo.py also, use breakpoint() to locate the OOM code line,thks

ProKSMT commented 2 weeks ago

I don‘t know how to use that to locate the OOM code line. Maybe this log will be helpful. image

And I will test the cogvideox-dev branch as soon as possible.

GuanleiGao commented 2 weeks ago

I had the same problem with V100, and it was solved by switching to A10. It seems to be a graphics card problem

ProKSMT commented 2 weeks ago

I think so too. I found that V100 does not support bf16. I switched the dtype to fp16 and it worked (main branch). So I think it might not be necessary to test on the dev branch. However, I don't know exactly how the V100 leads to OOM just because it doesn't support bf16. Maybe the auto type conversion make the VRAM consumption multiply I guess.

Exploder98 commented 2 weeks ago

I'm seeing this on my AMD RX 6900 XT. Changing the dtype does not have any effect, though. Could this have something to do with Flash Attention or Memory efficient attention support? I know that on my GPU neither of those work.

zRzRzRzRzRzRzR commented 2 weeks ago

I think we need to try this issue. The 3060 desktop version has only 12G, but it can run the 5B model normally. However, there is feedback from developers that the V100 32G has problems running the 5B model, while the 2B model runs normally. I will check if it is a precision issue.

ProKSMT commented 2 weeks ago

OK, I just tested it on the dev branch, and the same issue occurred. It also shows as 56.50G image

zRzRzRzRzRzRzR commented 2 weeks ago

Check if several key positions are open

  1. Do not attempt to enable online quantization, this may cause errors on the GPU in this architecture.

  2. Try to check several key nodes

    pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.float16)
    pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b"
    text_encoder=text_encoder,
    transformer=transformer
    woe=woe,
    torch_dtype=torch.float16
    )

You must use FP16 on T4 unless you are using a GPU with Ampere or higher architecture that supports BF16 Additionally, do not use .to(device), as this allows for better compression on the CPU and memory, rather than transferring the entire complete model to the GPU.

  1. Finally, check whether these four memory-saving schemes are enabled
    pipe.enable_model_cpu_offload()
    pipe.enable_sequential_cpu_offload()
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()

    I am already running normally on T4 on colab

    image

Please check if this can help you

ProKSMT commented 2 weeks ago

So it seems that the V100 can not run in BF16 mode. But It looks like that FP16 mode is not as good as BF16. image-20240830143711249

Will you release a special FP16 version of the 5B model ?

zRzRzRzRzRzRzR commented 2 weeks ago

So it seems that the V100 can not run in BF16 mode. But It looks like that FP16 mode is not as good as BF16. image-20240830143711249

Will you release a special FP16 version of the 5B model ?

We tried, but the results weren’t ideal. The 5B model is currently recommended to run at BF16 precision, which is also the precision we used for training. Converting to FP16 leads to suboptimal performance. However, the 2B model has lower compatibility requirements and can run effectively in FP16.

camenduru commented 2 weeks ago

free colab: https://github.com/camenduru/CogVideoX-5B-jupyter

lonngxiang commented 1 week ago

use FP16 on T4 ,still error https://colab.research.google.com/drive/14TTaDTM3_lk69qKb5u4-1_gm_YK6lM3m?usp=sharing

# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

image

lonngxiang commented 1 week ago

pipe.enable_sequential_cpu_offload() not work

zRzRzRzRzRzRzR commented 1 week ago

why not work ,this should use with diffusers>=0.30.1 and using FP16 model,not INT8

lonngxiang commented 1 week ago

why not work ,this should use with diffusers>=0.30.1 and using FP16 model,not INT8 i use diffusers 0.30.2;and not int8, you can see my code this link:https://colab.research.google.com/drive/14TTaDTM3_lk69qKb5u4-1_gm_YK6lM3m?usp=sharing

zRzRzRzRzRzRzR commented 1 week ago
image

This is not right and we have upload a colab example friendly link in our readme

lonngxiang commented 1 week ago

where is colab link? i can't see that, please send down, T4 can run?

zRzRzRzRzRzRzR commented 1 week ago

https://github.com/camenduru/CogVideoX-5B-jupyter

lonngxiang commented 1 week ago

Member 这生成时间需要1个多小时吗 image

zRzRzRzRzRzRzR commented 1 week ago

There is no need, but it does take a long time (in my T4 colab similar code it takes about 20 minutes, this is due to the computational power limitations of this generation of GPUs, and to compress memory usage, a lot of time-for-space solutions are used, resulting in very slow speed. Additionally, the T4 cannot run BF16 models, and the quality of FP16 inference cannot be guaranteed, so we recommend using newer GPUs for inference and fine-tuning.

lonngxiang commented 1 week ago

takes 1hour image