Closed chaochen1998 closed 10 months ago
如果使用默认的初始化与加载权重的方式,模型会先初始化一个fp32的版本,然后再转成bf16。fp32的版本需要占用约140GB左右的内存或者显存,bf16的版本需要占用71GB左右的内存/显存。
如果你不希望先init一个真实占用内存/显存的fp32的版本,可以尝试结合accelerate的init_empty_weights和load_checkpoint_and_dispatch来初始化并加载模型,这样能够减少内存/显存使用,这种方式大约需要70多GB内存/显存。
明白了,十分感谢。
还有一个问题是当我用repo中的代码用单机4卡(24G)运行hf版本的emu2-chat
模型时,使用了offload
,然后将一张图片加载进去,转化为bfloat16
,然后会报错:
WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.
time cost of loading model: 172.20009589195251 seconds
Traceback (most recent call last):
File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
outputs = model.generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 159, in generate
prompt_image_embeds = self.project_up(prompt_image_embeds)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float
意思就是有模型参数不为bfloat16
,于是我将图像转化为float
,在其他地方遇到了参数为bfloat16
的情况:
Traceback (most recent call last):
File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
outputs = model.generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 156, in generate
prompt_image_embeds = self.model.encode_image(image, n_query=self.n_query)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 92, in encode_image
image_embeds = self.visual(image)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 448, in forward
features = self.forward_features(x) # [B, n_patch, C]
File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 408, in forward_features
x = self.patch_embed(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 339, in forward
x = self.proj(x).flatten(2).transpose(1, 2)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
使用代码:
from PIL import Image
import requests
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
import time
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:70"
p_time = time.time()
tokenizer = AutoTokenizer.from_pretrained("/xxx/Project/Emu/Emu2/weights") # "BAAI/Emu2-Chat"
c_time = time.time()
print(f"time cost of loading tokenizer: {c_time-p_time} seconds")
p_time = time.time()
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(
"/xxx/Project/Emu/Emu2/weights", # "BAAI/Emu2-Chat"
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True)
c_time = time.time()
print(f"time cost of loading model (no gpu): {c_time-p_time} seconds")
device_map = infer_auto_device_map(model, max_memory={0:'16GiB',1:'18GiB',2:'18GiB',3:'18GiB'}, no_split_module_classes=['Block','LlamaDecoderLayer'])
# input and output logits should be on same device
device_map["model.decoder.lm.lm_head"] = 0
p_time = time.time()
model = load_checkpoint_and_dispatch(
model,
checkpoint="/xxx/Project/Emu/Emu2/weights",
device_map=device_map,
offload_folder="/xxx/Project/Emu/Emu2/offload_folder").eval()
c_time = time.time()
print(f"time cost of loading model: {c_time-p_time} seconds")
# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings.
# the number of `[<IMG_PLH>]` should be equal to the number of input images
query = "[<IMG_PLH>][descripe this picture]"
images = [
Image.open("./examples/red_white_3_bottom_left.jpg").convert('RGB'),
]
inputs = model.build_input_ids(
text=[query],
tokenizer=tokenizer,
image=images
)
p_time = time.time()
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image=inputs["image"].to(torch.bfloat16),
max_new_tokens=64,
length_penalty=-1)
c_time = time.time()
print(f"time cost of generating answer: {c_time-p_time} seconds")
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(output_text)
这种情况应该如何处理呢? 再次感谢解答
这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?
这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?
尝试了设置dtype=torch.bfloat16
,又出现了新的问题:
Traceback (most recent call last):
File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
with torch.no_grad():
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 156, in generate
prompt_image_embeds = self.model.encode_image(image, n_query=self.n_query)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 92, in encode_image
image_embeds = self.visual(image)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 448, in forward
features = self.forward_features(x) # [B, n_patch, C]
File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 408, in forward_features
x = self.patch_embed(x)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 339, in forward
x = self.proj(x).flatten(2).transpose(1, 2)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (c10::BFloat16) should be the same
为什么会出现half
的类型呢?
好像是这个visial
这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?
根据报错的提示,我把模型和图片换成了torch.float16
,模型能够成功运行了。
我还有几个问题:
torch.float32
吗?offload
)再次感谢您的耐心回答。
附我这里分配的device map(我这里开发环境只有2张卡,所以merge了一下你代码里分配的显存):
好像是这个visial
这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?
根据报错的提示,我把模型和图片换成了
torch.float16
,模型能够成功运行了。我还有几个问题:
- HF上的demo使用的模型参数类型是
torch.float32
吗?- 在您的机器上,这段代码能运行吗?(数据类型为torch.bfloat16同时使用
offload
)再次感谢您的耐心回答。
1、我们所有的training、evaluation以及demo都使用的bf16精度。 2、HF model zoo中Emu2和Emu2-Chat上传的是float32精度,Emu2-Gen是bfloat16精度。 3、所有native pytorch版本的模型都是bfoat16精度。
- 我按照你上面给的代码跑了一下,在我这边是能够正常inference的。
- 你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行,这个错误含义是传入的image的dtype是half,而模型权重是bf16,你可以看下确认下传入的image的data type,是不是在哪里被改了。
附我这里分配的device map(我这里开发环境只有2张卡,所以merge了一下你代码里分配的显存):
很奇怪,我修改图片的数据类型是和实例代码一样,在generate
时传入image=inputs["image"].to(torch.bfloat16)
,但是会报出Half
这个错误,其他地方都一样。
另外我在disk
上的权重是float32
这个正常吗?
- 我按照你上面给的代码跑了一下,在我这边是能够正常inference的。
- 你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行,这个错误含义是传入的image的dtype是half,而模型权重是bf16,你可以看下确认下传入的image的data type,是不是在哪里被改了。
附我这里分配的device map(我这里开发环境只有2张卡,所以merge了一下你代码里分配的显存):
很奇怪,我修改图片的数据类型是和实例代码一样,在
generate
时传入image=inputs["image"].to(torch.bfloat16)
,但是会报出Half
这个错误,其他地方都一样。另外我在
disk
上的权重是float32
这个正常吗?
在我这边,把dtype=torch.bfloat16
传给load_checkpoint_and_dispatch
后,for n, p in model.named_parameters():
所有p.dtype
都是torch.bfloat16
。
方便麻烦一下贴一下代码吗
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./Emu2-Chat/")
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained(
"./Emu2-Chat", # "BAAI/Emu2-Chat"
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True)
device_map = infer_auto_device_map(model, max_memory={0:'34GiB',1:'36GiB'}, no_split_module_classes=['Block','LlamaDecoderLayer'])
device_map["model.decoder.lm.lm_head"] = 0
model = load_checkpoint_and_dispatch(
model,
checkpoint="./Emu2-Chat/",
device_map=device_map,
dtype=torch.bfloat16,
offload_folder="./tmp/offload_folder").eval()
query = "[<IMG_PLH>]describe this picture"
images = [
Image.open("./examples/red_white_3_bottom_left.jpg").convert('RGB'),
]
inputs = model.build_input_ids(
text=[query],
tokenizer=tokenizer,
image=images
)
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
image=inputs["image"].to(torch.bfloat16),
max_new_tokens=64,
length_penalty=-1,
)
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# ['A red circle with the number three in the middle.']
@SANJINGSHOU14 如果还有问题,可以先确认下各个包的版本是不是和requirements.txt中一致。
@SANJINGSHOU14 如果还有问题,可以先确认下各个包的版本是不是和requirements.txt中一致。
十分感谢,我检查了一下版本,发现transformers
的版本不一样,换成4.30.1
后问题解决了,可以正常运行torch.bfloat16
,不过有个warning
:
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
我搜了下,发现很多人都没有管这个warning,不知道什么情况会有影响,需要把legacy
设置为False
吗?
我们在训练、测试的流程中,这个参数并没有额外设置,使用的是默认值。所以这个参数对模型性能的影响还没有做过深入探究。
好的明白,十分感谢您的回答。
@SANJINGSHOU14 这是什么cha查显存的工具吗
@xmy0916 这个就是Emu2这个repo里面作者代码里面的注释,应该是作者统计的吧,你也可以自己算一下。
你好,我在使用
https://model.baai.ac.cn/model-detail/220122/Emu2-Chat_pytorch_model.bf16.pth
权重进行pytorch推理时发现每一层的的大小和计算出来的不符合:这是代码中的注释,其中最后一行中每一层的参数数量为
535049216
,类型为bf16
,那么大小应该为2字节,算出来每一层所占显存为0.997G
,但是我在加载到GPU时,通过实时观察显存变化,发现一层所占的显存大小为大概2G
多一点点,导致22.2G
的GPU只能加载10
层。是不是说明加载ckpts时,保存的是4字节的数据?这样的话,不能放下float32权重的机器也跑不了bf16,是这样吗?