baaivision / Emu

Emu Series: Generative Multimodal Models from BAAI
https://baaivision.github.io/emu2/
Apache License 2.0
1.66k stars 86 forks source link

显存大小不一致 #71

Closed chaochen1998 closed 10 months ago

chaochen1998 commented 11 months ago

你好,我在使用https://model.baai.ac.cn/model-detail/220122/Emu2-Chat_pytorch_model.bf16.pth权重进行pytorch推理时发现每一层的的大小和计算出来的不符合:

    emu_model.visual:                           4B
    emu_model.decoder.lm.project_down:        omit
    emu_model.decoder.lm.project_up:          omit
    emu_model.decoder.lm.model.embed_tokens:  omit
    emu_model.decoder.lm.model.norm:          omit
    emu_model.decoder.lm.lm_head:             omit
    emu_model.decoder.lm.model.layers.[0..59]: 33B (0.55B/layer)

这是代码中的注释,其中最后一行中每一层的参数数量为535049216,类型为bf16,那么大小应该为2字节,算出来每一层所占显存为0.997G,但是我在加载到GPU时,通过实时观察显存变化,发现一层所占的显存大小为大概2G多一点点,导致22.2G的GPU只能加载10层。是不是说明加载ckpts时,保存的是4字节的数据?这样的话,不能放下float32权重的机器也跑不了bf16,是这样吗?

ryanzhangfan commented 11 months ago

如果使用默认的初始化与加载权重的方式,模型会先初始化一个fp32的版本,然后再转成bf16。fp32的版本需要占用约140GB左右的内存或者显存,bf16的版本需要占用71GB左右的内存/显存。

如果你不希望先init一个真实占用内存/显存的fp32的版本,可以尝试结合accelerate的init_empty_weights和load_checkpoint_and_dispatch来初始化并加载模型,这样能够减少内存/显存使用,这种方式大约需要70多GB内存/显存。

chaochen1998 commented 11 months ago

明白了,十分感谢。

还有一个问题是当我用repo中的代码用单机4卡(24G)运行hf版本的emu2-chat模型时,使用了offload,然后将一张图片加载进去,转化为bfloat16,然后会报错:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the disk.
time cost of loading model: 172.20009589195251 seconds
Traceback (most recent call last):
  File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
    outputs = model.generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 159, in generate
    prompt_image_embeds = self.project_up(prompt_image_embeds)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float

意思就是有模型参数不为bfloat16,于是我将图像转化为float,在其他地方遇到了参数为bfloat16的情况:

Traceback (most recent call last):
  File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
    outputs = model.generate(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 156, in generate
    prompt_image_embeds = self.model.encode_image(image, n_query=self.n_query)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 92, in encode_image
    image_embeds = self.visual(image)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 448, in forward
    features = self.forward_features(x)  # [B, n_patch, C]
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 408, in forward_features
    x = self.patch_embed(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 339, in forward
    x = self.proj(x).flatten(2).transpose(1, 2)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

使用代码:

from PIL import Image 
import requests
import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
import time
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:70"
p_time = time.time()
tokenizer = AutoTokenizer.from_pretrained("/xxx/Project/Emu/Emu2/weights") # "BAAI/Emu2-Chat"
c_time = time.time()
print(f"time cost of loading tokenizer: {c_time-p_time} seconds")

p_time = time.time()
with init_empty_weights():
     model = AutoModelForCausalLM.from_pretrained(
        "/xxx/Project/Emu/Emu2/weights", # "BAAI/Emu2-Chat"
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        trust_remote_code=True)
c_time = time.time()
print(f"time cost of loading model (no gpu): {c_time-p_time} seconds")

device_map = infer_auto_device_map(model, max_memory={0:'16GiB',1:'18GiB',2:'18GiB',3:'18GiB'}, no_split_module_classes=['Block','LlamaDecoderLayer'])

# input and output logits should be on same device
device_map["model.decoder.lm.lm_head"] = 0

p_time = time.time()
model = load_checkpoint_and_dispatch(
    model,
    checkpoint="/xxx/Project/Emu/Emu2/weights",
    device_map=device_map,
    offload_folder="/xxx/Project/Emu/Emu2/offload_folder").eval()
c_time = time.time()
print(f"time cost of loading model: {c_time-p_time} seconds")

# `[<IMG_PLH>]` is the image placeholder which will be replaced by image embeddings. 
# the number of `[<IMG_PLH>]` should be equal to the number of input images
query = "[<IMG_PLH>][descripe this picture]"

images = [
    Image.open("./examples/red_white_3_bottom_left.jpg").convert('RGB'),
]

inputs = model.build_input_ids(
    text=[query],
    tokenizer=tokenizer,
    image=images

)

p_time = time.time()
with torch.no_grad():
     outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        image=inputs["image"].to(torch.bfloat16),
        max_new_tokens=64,
        length_penalty=-1)
c_time = time.time()
print(f"time cost of generating answer: {c_time-p_time} seconds")
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(output_text)

这种情况应该如何处理呢? 再次感谢解答

ryanzhangfan commented 11 months ago

这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?

chaochen1998 commented 11 months ago

这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?

尝试了设置dtype=torch.bfloat16,又出现了新的问题:

Traceback (most recent call last):
  File "/xxx/Project/Emu/Emu2/test.py", line 57, in <module>
    with torch.no_grad():
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 156, in generate
    prompt_image_embeds = self.model.encode_image(image, n_query=self.n_query)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/modeling_emu.py", line 92, in encode_image
    image_embeds = self.visual(image)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 448, in forward
    features = self.forward_features(x)  # [B, n_patch, C]
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 408, in forward_features
    x = self.patch_embed(x)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/weights/visual.py", line 339, in forward
    x = self.proj(x).flatten(2).transpose(1, 2)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (c10::Half) and bias type (c10::BFloat16) should be the same

为什么会出现half的类型呢?

chaochen1998 commented 11 months ago

好像是这个visial

这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?

根据报错的提示,我把模型和图片换成了torch.float16,模型能够成功运行了。

我还有几个问题:

  1. HF上的demo使用的模型参数类型是torch.float32吗?
  2. 在您的机器上,这段代码能运行吗?(数据类型为torch.bfloat16同时使用offload)

再次感谢您的耐心回答。

ryanzhangfan commented 11 months ago

附我这里分配的device map(我这里开发环境只有2张卡,所以merge了一下你代码里分配的显存): image

ryanzhangfan commented 11 months ago

好像是这个visial

这种情况应该是offload的时候,存到disk的参数data type不是bfloat16而是float32,导致inference的时候出错。尝试在调用load_checkpoint_and_dispatch函数时,传入dtype=torch.bfloat16将所有模型权重指定为bfloat16呢?

根据报错的提示,我把模型和图片换成了torch.float16,模型能够成功运行了。

我还有几个问题:

  1. HF上的demo使用的模型参数类型是torch.float32吗?
  2. 在您的机器上,这段代码能运行吗?(数据类型为torch.bfloat16同时使用offload)

再次感谢您的耐心回答。

1、我们所有的training、evaluation以及demo都使用的bf16精度。 2、HF model zoo中Emu2和Emu2-Chat上传的是float32精度,Emu2-Gen是bfloat16精度。 3、所有native pytorch版本的模型都是bfoat16精度。

chaochen1998 commented 11 months ago
  • 我按照你上面给的代码跑了一下,在我这边是能够正常inference的。
  • 你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行,这个错误含义是传入的image的dtype是half,而模型权重是bf16,你可以看下确认下传入的image的data type,是不是在哪里被改了。

附我这里分配的device map(我这里开发环境只有2张卡,所以merge了一下你代码里分配的显存): image

很奇怪,我修改图片的数据类型是和实例代码一样,在generate时传入image=inputs["image"].to(torch.bfloat16),但是会报出Half这个错误,其他地方都一样。

另外我在disk上的权重是float32这个正常吗?

ryanzhangfan commented 11 months ago
  • 我按照你上面给的代码跑了一下,在我这边是能够正常inference的。
  • 你报错的位置几乎相当于是整个inference的最开始。是提取图像特征的forward函数的第一行,这个错误含义是传入的image的dtype是half,而模型权重是bf16,你可以看下确认下传入的image的data type,是不是在哪里被改了。

附我这里分配的device map(我这里开发环境只有2张卡,所以merge了一下你代码里分配的显存): image

很奇怪,我修改图片的数据类型是和实例代码一样,在generate时传入image=inputs["image"].to(torch.bfloat16),但是会报出Half这个错误,其他地方都一样。

另外我在disk上的权重是float32这个正常吗?

在我这边,把dtype=torch.bfloat16传给load_checkpoint_and_dispatch后,for n, p in model.named_parameters():所有p.dtype都是torch.bfloat16

chaochen1998 commented 11 months ago

方便麻烦一下贴一下代码吗

ryanzhangfan commented 11 months ago
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./Emu2-Chat/")

with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(
    "./Emu2-Chat", # "BAAI/Emu2-Chat"
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True)

device_map = infer_auto_device_map(model, max_memory={0:'34GiB',1:'36GiB'}, no_split_module_classes=['Block','LlamaDecoderLayer'])
device_map["model.decoder.lm.lm_head"] = 0

model = load_checkpoint_and_dispatch(
    model,
    checkpoint="./Emu2-Chat/",
    device_map=device_map,
    dtype=torch.bfloat16,
    offload_folder="./tmp/offload_folder").eval()

query = "[<IMG_PLH>]describe this picture"
images = [
    Image.open("./examples/red_white_3_bottom_left.jpg").convert('RGB'),
]

inputs = model.build_input_ids(
    text=[query],
    tokenizer=tokenizer,
    image=images
)

with torch.no_grad():
    outputs = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    image=inputs["image"].to(torch.bfloat16),
    max_new_tokens=64,
    length_penalty=-1,
)

output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# ['A red circle with the number three in the middle.']
ryanzhangfan commented 11 months ago

@SANJINGSHOU14 如果还有问题,可以先确认下各个包的版本是不是和requirements.txt中一致。

chaochen1998 commented 10 months ago

@SANJINGSHOU14 如果还有问题,可以先确认下各个包的版本是不是和requirements.txt中一致。

十分感谢,我检查了一下版本,发现transformers的版本不一样,换成4.30.1后问题解决了,可以正常运行torch.bfloat16,不过有个warning

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565

我搜了下,发现很多人都没有管这个warning,不知道什么情况会有影响,需要把legacy设置为False吗?

ryanzhangfan commented 10 months ago

我们在训练、测试的流程中,这个参数并没有额外设置,使用的是默认值。所以这个参数对模型性能的影响还没有做过深入探究。

chaochen1998 commented 10 months ago

好的明白,十分感谢您的回答。

xmy0916 commented 10 months ago

@SANJINGSHOU14 image 这是什么cha查显存的工具吗

chaochen1998 commented 10 months ago

@xmy0916 这个就是Emu2这个repo里面作者代码里面的注释,应该是作者统计的吧,你也可以自己算一下。