显存大小不一致 - Githubissues

chaochen1998 commented 11 months ago

你好，我在使用https://model.baai.ac.cn/model-detail/220122/Emu2-Chat_pytorch_model.bf16.pth权重进行pytorch推理时发现每一层的的大小和计算出来的不符合：

    emu_model.visual:                           4B
    emu_model.decoder.lm.project_down:        omit
    emu_model.decoder.lm.project_up:          omit
    emu_model.decoder.lm.model.embed_tokens:  omit
    emu_model.decoder.lm.model.norm:          omit
    emu_model.decoder.lm.lm_head:             omit
    emu_model.decoder.lm.model.layers.[0..59]: 33B (0.55B/layer)

这是代码中的注释，其中最后一行中每一层的参数数量为535049216，类型为bf16，那么大小应该为2字节，算出来每一层所占显存为0.997G，但是我在加载到GPU时，通过实时观察显存变化，发现一层所占的显存大小为大概2G多一点点，导致22.2G的GPU只能加载10层。是不是说明加载ckpts时，保存的是4字节的数据？这样的话，不能放下float32权重的机器也跑不了bf16，是这样吗？

ryanzhangfan commented 11 months ago

如果使用默认的初始化与加载权重的方式，模型会先初始化一个fp32的版本，然后再转成bf16。fp32的版本需要占用约140GB左右的内存或者显存，bf16的版本需要占用71GB左右的内存/显存。

如果你不希望先init一个真实占用内存/显存的fp32的版本，可以尝试结合accelerate的init_empty_weights和load_checkpoint_and_dispatch来初始化并加载模型，这样能够减少内存/显存使用，这种方式大约需要70多GB内存/显存。

chaochen1998 commented 11 months ago

明白了，十分感谢。

还有一个问题是当我用repo中的代码用单机4卡（24G）运行hf版本的emu2-chat模型时，使用了offload，然后将一张图片加载进去，转化为bfloat16，然后会报错:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.
time cost of loading model: 172.20009589195251 seconds
Traceback (most recent call last):
  File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
    outputs = model.generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 159, in generate
    prompt_image_embeds = self.project_up(prompt_image_embeds)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

意思就是有模型参数不为bfloat16，于是我将图像转化为float，在其他地方遇到了参数为bfloat16的情况：

Traceback (most recent call last):
  File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
    outputs = model.generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 156, in generate
    prompt_image_embeds = self.model.encode_image(image, n_query=self.n_query)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 92, in encode_image
    image_embeds = self.visual(image)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 448, in forward
    features = self.forward_features(x)  # [B, n_patch, C]
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 408, in forward_features
    x = self.patch_embed(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 339, in forward
    x = self.proj(x).flatten(2).transpose(1, 2)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

使用代码：

from PIL import Image 
import requests
import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
import time
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:70"
p_time = time.time()
tokenizer = AutoTokenizer.from_pretrained("/xxx/Project/Emu/Emu2/weights") # "BAAI/Emu2-Chat"
c_time = time.time()
print(f"time cost of loading tokenizer: {c_time-p_time} seconds")

p_time = time.time()
with init_empty_weights():
     model = AutoModelForCausalLM.from_pretrained(
        "/xxx/Project/Emu/Emu2/weights", # "BAAI/Emu2-Chat"
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True)
c_time = time.time()
print(f"time cost of loading model (no gpu): {c_time-p_time} seconds")

device_map = infer_auto_device_map(model, max_memory={0:'16GiB',1:'18GiB',2:'18GiB',3:'18GiB'}, no_split_module_classes=['Block','LlamaDecoderLayer'])

# input and output logits should be on same device
device_map["model.decoder.lm.lm_head"] = 0

p_time = time.time()
model = load_checkpoint_and_dispatch(
    model,
    checkpoint="/xxx/Project/Emu/Emu2/weights",
    device_map=device_map,
    offload_folder="/xxx/Project/Emu/Emu2/offload_folder").eval()
c_time = time.time()
print(f"time cost of loading model: {c_time-p_time} seconds")

# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings. 
# the number of `[<IMG_PLH>]` should be equal to the number of input images
query = "[<IMG_PLH>][descripe this picture]"

images = [
    Image.open("./examples/red_white_3_bottom_left.jpg").convert('RGB'),
]

inputs = model.build_input_ids(
    text=[query],
    tokenizer=tokenizer,
    image=images

)

p_time = time.time()
with torch.no_grad():
     outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        image=inputs["image"].to(torch.bfloat16),
        max_new_tokens=64,
        length_penalty=-1)
c_time = time.time()
print(f"time cost of generating answer: {c_time-p_time} seconds")
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(output_text)

这种情况应该如何处理呢？再次感谢解答

ryanzhangfan commented 11 months ago

这种情况应该是offload的时候，存到disk的参数data type不是bfloat16而是float32，导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时，传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢？

chaochen1998 commented 11 months ago

这种情况应该是offload的时候，存到disk的参数data type不是bfloat16而是float32，导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时，传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢？

尝试了设置dtype=torch.bfloat16，又出现了新的问题：

Traceback (most recent call last):
  File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
    with torch.no_grad():
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 156, in generate
    prompt_image_embeds = self.model.encode_image(image, n_query=self.n_query)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 92, in encode_image
    image_embeds = self.visual(image)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 448, in forward
    features = self.forward_features(x)  # [B, n_patch, C]
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 408, in forward_features
    x = self.patch_embed(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 339, in forward
    x = self.proj(x).flatten(2).transpose(1, 2)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (c10::BFloat16) should be the same

为什么会出现half的类型呢？

chaochen1998 commented 11 months ago

好像是这个visial

这种情况应该是offload的时候，存到disk的参数data type不是bfloat16而是float32，导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时，传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢？

根据报错的提示，我把模型和图片换成了torch.float16，模型能够成功运行了。

我还有几个问题：

HF上的demo使用的模型参数类型是torch.float32吗？
在您的机器上，这段代码能运行吗？(数据类型为torch.bfloat16同时使用offload)

再次感谢您的耐心回答。

ryanzhangfan commented 11 months ago

我按照你上面给的代码跑了一下，在我这边是能够正常inference的。
你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行，这个错误含义是传入的image的dtype是half，而模型权重是bf16，你可以看下确认下传入的image的data type，是不是在哪里被改了。

附我这里分配的device map（我这里开发环境只有2张卡，所以merge了一下你代码里分配的显存）：

ryanzhangfan commented 11 months ago

好像是这个visial

这种情况应该是offload的时候，存到disk的参数data type不是bfloat16而是float32，导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时，传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢？

根据报错的提示，我把模型和图片换成了torch.float16，模型能够成功运行了。

我还有几个问题：

HF上的demo使用的模型参数类型是torch.float32吗？

在您的机器上，这段代码能运行吗？(数据类型为torch.bfloat16同时使用offload)

再次感谢您的耐心回答。

1、我们所有的training、evaluation以及demo都使用的bf16精度。 2、HF model zoo中Emu2和Emu2-Chat上传的是float32精度，Emu2-Gen是bfloat16精度。 3、所有native pytorch版本的模型都是bfoat16精度。

chaochen1998 commented 11 months ago

我按照你上面给的代码跑了一下，在我这边是能够正常inference的。

你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行，这个错误含义是传入的image的dtype是half，而模型权重是bf16，你可以看下确认下传入的image的data type，是不是在哪里被改了。

附我这里分配的device map（我这里开发环境只有2张卡，所以merge了一下你代码里分配的显存）：

很奇怪，我修改图片的数据类型是和实例代码一样，在generate时传入image=inputs["image"].to(torch.bfloat16)，但是会报出Half这个错误，其他地方都一样。

另外我在disk上的权重是float32这个正常吗？

ryanzhangfan commented 11 months ago

我按照你上面给的代码跑了一下，在我这边是能够正常inference的。

你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行，这个错误含义是传入的image的dtype是half，而模型权重是bf16，你可以看下确认下传入的image的data type，是不是在哪里被改了。

附我这里分配的device map（我这里开发环境只有2张卡，所以merge了一下你代码里分配的显存）：

很奇怪，我修改图片的数据类型是和实例代码一样，在generate时传入image=inputs["image"].to(torch.bfloat16)，但是会报出Half这个错误，其他地方都一样。

另外我在disk上的权重是float32这个正常吗？

在我这边，把dtype=torch.bfloat16传给load_checkpoint_and_dispatch后，for n, p in model.named_parameters():所有p.dtype都是torch.bfloat16。

chaochen1998 commented 11 months ago

方便麻烦一下贴一下代码吗

ryanzhangfan commented 11 months ago

from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./Emu2-Chat/")

with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
    "./Emu2-Chat", # "BAAI/Emu2-Chat"
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True)

device_map = infer_auto_device_map(model, max_memory={0:'34GiB',1:'36GiB'}, no_split_module_classes=['Block','LlamaDecoderLayer'])
device_map["model.decoder.lm.lm_head"] = 0

model = load_checkpoint_and_dispatch(
    model,
    checkpoint="./Emu2-Chat/",
    device_map=device_map,
    dtype=torch.bfloat16,
    offload_folder="./tmp/offload_folder").eval()

query = "[<IMG_PLH>]describe this picture"
images = [
    Image.open("./examples/red_white_3_bottom_left.jpg").convert('RGB'),
]

inputs = model.build_input_ids(
    text=[query],
    tokenizer=tokenizer,
    image=images
)

with torch.no_grad():
    outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    image=inputs["image"].to(torch.bfloat16),
    max_new_tokens=64,
    length_penalty=-1,
)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# ['A red circle with the number three in the middle.']

ryanzhangfan commented 11 months ago

@SANJINGSHOU14 如果还有问题，可以先确认下各个包的版本是不是和requirements.txt中一致。

chaochen1998 commented 10 months ago

@SANJINGSHOU14 如果还有问题，可以先确认下各个包的版本是不是和requirements.txt中一致。

十分感谢，我检查了一下版本，发现transformers的版本不一样，换成4.30.1后问题解决了，可以正常运行torch.bfloat16，不过有个warning：

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565

我搜了下，发现很多人都没有管这个warning，不知道什么情况会有影响，需要把legacy设置为False吗？

ryanzhangfan commented 10 months ago

我们在训练、测试的流程中，这个参数并没有额外设置，使用的是默认值。所以这个参数对模型性能的影响还没有做过深入探究。

chaochen1998 commented 10 months ago

好的明白，十分感谢您的回答。

xmy0916 commented 10 months ago

@SANJINGSHOU14 这是什么cha查显存的工具吗

chaochen1998 commented 10 months ago

@xmy0916 这个就是Emu2这个repo里面作者代码里面的注释，应该是作者统计的吧，你也可以自己算一下。

baaivision / Emu

显存大小不一致 #71