Closed ApolloRay closed 3 months ago
In the process of Loading checkpoint shards(2/2), the VRAM is bigger than 22G.
出现了一样的问题,这个模型24G的显卡推不起来 感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了
出现了一样的问题,这个模型24G的显卡推不起来 感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了
text_encoder = T5EncoderModel.from_pretrained(path, subfolder="text_encoder").to("cpu") text_encoder = text_encoder.half() text_encoder = text_encoder.eval().to("cuda") text_encoder.save_pretrained(save_path)
I think you can try this code. VRAM for T5 will reduce to 10G.
出现了一样的问题,这个模型24G的显卡推不起来 感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了
text_encoder = T5EncoderModel.from_pretrained(path, subfolder="text_encoder").to("cpu") text_encoder = text_encoder.half() text_encoder = text_encoder.eval().to("cuda") text_encoder.save_pretrained(save_path)
I think you can try this code. VRAM for T5 will reduce to 10G.
But I'm not sure whether it will affect the time and effect. especially the time-consuming. For time-consuming, https://github.com/huggingface/transformers/issues/11792
出现了一样的问题,这个模型24G的显卡推不起来 感觉这个要做一下模型的压缩 t5d文本embedding消耗太大了
Sorry, it doesn't work.
Here is an example for 8bit T5. Also, you can try to inference T5 on CPU device. Later we will support diffusers
to make to whole pipeline easier to inference.
Here is an example for 8bit T5. Also, you can try to inference T5 on CPU device. Later we will support
diffusers
to make to whole pipeline easier to inference.
Thanks ~
Here is an example for 8bit T5. Also, you can try to inference T5 on CPU device. Later we will support
diffusers
to make to whole pipeline easier to inference. Using this method, it is indeed possible to reduce the GPU memory usage, but it results in a situation where the int8 model is on the CPU, while other parts remain on the GPU. This still requires some handling to address the issue.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
More details
example
tokenizer = T5Tokenizer.from_pretrained(args.pipeline_load_from, load_in_8bit=True, subfolder="tokenizer") text_encoder = T5EncoderModel.from_pretrained(args.pipeline_load_from, load_in_8bit=True, device_map="auto",subfolder="text_encoder") change the script this part,then run,when you use gradio ui inference “RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_mm)”
More details
I'm OK.
replace with text_encoder = T5EncoderModel.from_pretrained(args.pipeline_load_from, load_in_8bit=True,subfolder="text_encoder")
and try. @liangwq
replace with
text_encoder = T5EncoderModel.from_pretrained(args.pipeline_load_from, load_in_8bit=True,subfolder="text_encoder")
and try. @liangwq
if 4090,always can‘t
How can I inference this model ?