A10 can't inference this model (512)?

PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

https://pixart-alpha.github.io/PixArt-sigma-project/

GNU Affero General Public License v3.0

1.47k stars 70 forks source link

A10 can't inference this model (512)? #19

Closed ApolloRay closed 3 months ago

ApolloRay commented 3 months ago

截屏2024-04-07 16 30 02

How can I inference this model ?

ApolloRay commented 3 months ago

In the process of Loading checkpoint shards(2/2), the VRAM is bigger than 22G.

liangwq commented 3 months ago

出现了一样的问题，这个模型24G的显卡推不起来感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了

ApolloRay commented 3 months ago

出现了一样的问题，这个模型24G的显卡推不起来感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了

text_encoder = T5EncoderModel.from_pretrained(path, subfolder="text_encoder").to("cpu") text_encoder = text_encoder.half() text_encoder = text_encoder.eval().to("cuda") text_encoder.save_pretrained(save_path) I think you can try this code. VRAM for T5 will reduce to 10G.

ApolloRay commented 3 months ago

出现了一样的问题，这个模型24G的显卡推不起来感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了

text_encoder = T5EncoderModel.from_pretrained(path, subfolder="text_encoder").to("cpu") text_encoder = text_encoder.half() text_encoder = text_encoder.eval().to("cuda") text_encoder.save_pretrained(save_path) I think you can try this code. VRAM for T5 will reduce to 10G.

But I'm not sure whether it will affect the time and effect. especially the time-consuming. For time-consuming, https://github.com/huggingface/transformers/issues/11792

ApolloRay commented 3 months ago

出现了一样的问题，这个模型24G的显卡推不起来感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了

Sorry, it doesn't work.

lawrence-cj commented 3 months ago

Here is an example for 8bit T5. Also, you can try to inference T5 on CPU device. Later we will support diffusers to make to whole pipeline easier to inference.

ApolloRay commented 3 months ago

Here is an example for 8bit T5. Also, you can try to inference T5 on CPU device. Later we will support diffusers to make to whole pipeline easier to inference.

Thanks ~

liangwq commented 3 months ago

Here is an example for 8bit T5. Also, you can try to inference T5 on CPU device. Later we will support diffusers to make to whole pipeline easier to inference. Using this method, it is indeed possible to reduce the GPU memory usage, but it results in a situation where the int8 model is on the CPU, while other parts remain on the GPU. This still requires some handling to address the issue.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

lawrence-cj commented 3 months ago

More details

liangwq commented 3 months ago

example

tokenizer = T5Tokenizer.from_pretrained(args.pipeline_load_from, load_in_8bit=True, subfolder="tokenizer") text_encoder = T5EncoderModel.from_pretrained(args.pipeline_load_from, load_in_8bit=True, device_map="auto",subfolder="text_encoder") change the script this part，then run，when you use gradio ui inference “RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)”

ApolloRay commented 3 months ago

More details

I'm OK.

lawrence-cj commented 3 months ago

replace with text_encoder = T5EncoderModel.from_pretrained(args.pipeline_load_from, load_in_8bit=True,subfolder="text_encoder") and try. @liangwq

liangwq commented 3 months ago

replace with text_encoder = T5EncoderModel.from_pretrained(args.pipeline_load_from, load_in_8bit=True,subfolder="text_encoder") and try. @liangwq

if 4090，always can‘t