"index_select_cuda" not implemented for 'Float8_e4m3fn' error from CogVideoXImageToVideoPipeline happens when FP8 used

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

7.86k stars 732 forks source link

"index_select_cuda" not implemented for 'Float8_e4m3fn' error from CogVideoXImageToVideoPipeline happens when FP8 used #366

Open FurkanGozukara opened 1 week ago

FurkanGozukara commented 1 week ago

I have opened a detailed issue here anyone has any ideas? happens when FP8 used

https://github.com/huggingface/diffusers/issues/9539

FurkanGozukara commented 1 week ago

this doesnt make sense to me because we are able to use FP8 mode of FLUX model and T5 XXL when using FLUX with ComfyUI

FurkanGozukara commented 1 week ago

please add a support to your pipe to support FP8 running transformer and T5 XXL it is totally doable

@zRzRzRzRzRzRzR @wenyihong @chenxwh

so that pipeline_cogvideox_image2video can run fast on 24 GB GPUs on Windows

currently it is mandatory to use cpu offloading which is totally overkill thank you

pipeline_cogvideox_image2video.txt

zRzRzRzRzRzRzR commented 1 week ago

This appears to be an issue with T5. Additionally, if your GPU is sufficient, you can remove all configurations of cpu_offload and use pipe.to("cuda"). The current FP8 is implemented through torchao with fp8 weights for BF16 inference. If you want to use E4M3 inference, some adjustments will likely be needed. During my testing, I encountered an error similar to yours. I have contacted diffusers about this error, and I suspect there are some incompatibilities within the underlying libraries of torchao, torch, and diffusers. We will attempt to address this in the future, but this work may need to be completed by the community, as we currently do not have enough manpower.

FurkanGozukara commented 1 week ago

This appears to be an issue with T5. Additionally, if your GPU is sufficient, you can remove all configurations of cpu_offload and use pipe.to("cuda"). The current FP8 is implemented through torchao with fp8 weights for BF16 inference. If you want to use E4M3 inference, some adjustments will likely be needed. During my testing, I encountered an error similar to yours. I have contacted diffusers about this error, and I suspect there are some incompatibilities within the underlying libraries of torchao, torch, and diffusers. We will attempt to address this in the future, but this work may need to be completed by the community, as we currently do not have enough manpower.

thank you so much for reply

i have 24 GB and without optimizations it use 26 GB - i tested on cloud

I opened an issue on Diffusers as well - we are able to use T5 and FLUX in FP8 so i think CogVideo must be same