[Feature] 建了个分支，支持多GPU部署，自动平均分配显存。

THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

Apache License 2.0

40.43k stars 5.19k forks source link

[Feature] 建了个分支，支持多GPU部署，自动平均分配显存。 #200

Closed ChuangLee closed 1 year ago

ChuangLee commented 1 year ago

Is your feature request related to a problem? Please describe.

https://github.com/ChuangLee/ChatGLM-6B-multiGPU 自动平均分配显存。之前单卡要13GB，很多GPU刚好用不了，很尴尬。

Solutions

https://github.com/ChuangLee/ChatGLM-6B-multiGPU 已解决。

Additional context

No response

Ginzyl commented 1 year ago

加载量化后的int4模型会报错：

dogvane commented 1 year ago

多卡时，推理速度会有提升吗？

ChuangLee commented 1 year ago

加载量化后的int4模型会报错：

这是因为路径不对吧？不过都量化int4了还需要多卡吗？没有测试。

ChuangLee commented 1 year ago

多卡时，推理速度会有提升吗？

理论上会下降，因为涉及到GPU之间传输数据，不过经测试差不太多。

twwch commented 1 year ago

如何运行支持多GPU部署？确实会报错

Traceback (most recent call last):
  File "web_demo.py", line 6, in <module>
    model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4)
  File "/data/chenhao/codes/ChatGLM-6B/chatglm_parallel.py", line 34, in load_model_on_gpus
    model = load_checkpoint_and_dispatch(
  File "/data/chenhao/anaconda3/envs/ChatGLM-6B/lib/python3.8/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/data/chenhao/anaconda3/envs/ChatGLM-6B/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 899, in load_checkpoint_in_model
    raise ValueError(
ValueError: `checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded checkpoint, or a folder containing a sharded checkpoint, but got THUDM/chatglm-6b.

twwch commented 1 year ago

@ChuangLee 您可以运行成功吗？

Cherrysaber commented 1 year ago

加载量化后的int4模型会报错：

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True) model.save_pretrained(“./multi_gpus”,max_shard_size='2GB') 先用python运行上面两行代码，在运行webui就行了，模型路径填 _“./multi_gpus”_

Cherrysaber commented 1 year ago

加载量化后的int4模型会报错：

这是因为路径不对吧？不过都量化int4了还需要多卡吗？没有测试。

还是非常必要的，max_tokens直接和显存大小相关，int4模型能记录的上下文在相同配置下，远超正常模型。

twwch commented 1 year ago

加载量化后的int4模型会报错：

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True) model.save_pretrained(“./multi_gpus”,max_shard_size='2GB') 先用python运行上面两行代码，在运行webui就行了，模型路径填 _“./multi_gpus”_

这样确实可以跑起来，但是有出现了新问题确实是4张卡

错误信息

代码

界面上没有应答

Cherrysaber commented 1 year ago

加载量化后的int4模型会报错：

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True) model.save_pretrained(“./multi_gpus”,max_shard_size='2GB') 先用python运行上面两行代码，在运行webui就行了，模型路径填 _“./multi_gpus”_

这样确实可以跑起来，但是有出现了新问题确实是4张卡

错误信息

代码

界面上没有应答

试试下面的，我两个gpu是正常的。不行的话，把整个错误栈都贴上来

    num_trans_layers = 28
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': num_gpus - 1, 'lm_head': num_gpus - 1}
    gpu_target = 0
    for index in range(num_trans_layers):
        if index % num_gpus != 0:
            gpu_target += 1
        else:
            gpu_target = 0

        device_map[f'transformer.layers.{index}'] = gpu_target
    device_map['transformer.layers.27'] = num_gpus - 1

ChuangLee commented 1 year ago

"THUDM/chatglm-6b"

注意这个路径需要你是模型的路径，我这里是相对路径，放到了当前文件夹下。

ChuangLee commented 1 year ago

加载量化后的int4模型会报错：

model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True) model.save_pretrained(“./multi_gpus”,max_shard_size='2GB') 先用python运行上面两行代码，在运行webui就行了，模型路径填 _“./multi_gpus”_

这样确实可以跑起来，但是有出现了新问题确实是4张卡错误信息代码界面上没有应答

试试下面的，我两个gpu是正常的。不行的话，把整个错误栈都贴上来
    num_trans_layers = 28
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': num_gpus - 1, 'lm_head': num_gpus - 1}
    gpu_target = 0
    for index in range(num_trans_layers):
        if index % num_gpus != 0:
            gpu_target += 1
        else:
            gpu_target = 0

        device_map[f'transformer.layers.{index}'] = gpu_target
    device_map['transformer.layers.27'] = num_gpus - 1

我的代码测试过2~8个GPU都可以正常工作的，你只把参数改num_gpus=4不行吗？你用%模数指定gpu运行没有问题，但是疑似GPU之间的通信频率要增加好几倍，理论上效率会慢些。

LuRenJiasWorld commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices...

使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

Cherrysaber commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices...

使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

把整个错误栈贴上来

LuRenJiasWorld commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

把整个错误栈贴上来

(.venv) ChatGLM-6B-multiGPU | lurenjiasworld@thinkpad-neo14 [#] python web_demo.py
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00,  1.73s/it]
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Opening in existing browser session.
Traceback (most recent call last):
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
    output = await app.get_blocks().process_api(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
    result = await self.call_function(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/blocks.py", line 898, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/utils.py", line 549, in async_iteration
    return next(iterator)
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/web_demo.py", line 15, in predict
    for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 1152, in stream_chat
    for outputs in self.stream_generate(**input_ids, **gen_kwargs):
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 1229, in stream_generate
    outputs = self(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 1031, in forward
    transformer_outputs = self.transformer(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 851, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

ChuangLee commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices...

使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

试一下Cli的demo是否能正常运行。

gg22mm commented 1 year ago

Is your feature request related to a problem? Please describe.

https://github.com/ChuangLee/ChatGLM-6B-multiGPU 自动平均分配显存。之前单卡要13GB，很多GPU刚好用不了，很尴尬。

Solutions

https://github.com/ChuangLee/ChatGLM-6B-multiGPU 已解决。

Additional context

No response

大老，如果多个cpu怎么支持，我是64核的cpu

Cherrysaber commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

试一下Cli的demo是否能正常运行。

我windows都是正常的，切到wsl ubuntu就和他们一样报错我hook了torch.embedding 发现 weight 在wsl ubuntu被分配到了 cuda:1 , input却在cuda:0

Cherrysaber commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices...

使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

搞定了，用下面的代码

import os
from typing import Dict, Tuple, Union, Optional

from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
                       tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
    model = model.eval()

    device_map = auto_configure_device_map(num_gpus)
    try:
        model = load_checkpoint_and_dispatch(
            model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
    except ValueError:
        # index.json not found
        print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")

        assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
        model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
        model = load_checkpoint_and_dispatch(
            model, multi_gpu_model_cache_dir, device_map=device_map,
            offload_folder="offload", offload_state_dict=True).half()

        if tokenizer is not None:
            tokenizer.save_pretrained(multi_gpu_model_cache_dir)
        print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")

    return model

xiaoweiweixiao commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

搞定了，用下面的代码

import os
from typing import Dict, Tuple, Union, Optional

from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
                       tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
    model = model.eval()

    device_map = auto_configure_device_map(num_gpus)
    try:
        model = load_checkpoint_and_dispatch(
            model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
    except ValueError:
        # index.json not found
        print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")

        assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
        model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
        model = load_checkpoint_and_dispatch(
            model, multi_gpu_model_cache_dir, device_map=device_map,
            offload_folder="offload", offload_state_dict=True).half()

        if tokenizer is not None:
            tokenizer.save_pretrained(multi_gpu_model_cache_dir)
        print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")

    return model

@Cherrysaber 按照你的这个代码改了，还是报同样的错误，是还改动其它地方了吗？

Cherrysaber commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

搞定了，用下面的代码

import os
from typing import Dict, Tuple, Union, Optional

from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
                       tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
    model = model.eval()

    device_map = auto_configure_device_map(num_gpus)
    try:
        model = load_checkpoint_and_dispatch(
            model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
    except ValueError:
        # index.json not found
        print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")

        assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
        model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
        model = load_checkpoint_and_dispatch(
            model, multi_gpu_model_cache_dir, device_map=device_map,
            offload_folder="offload", offload_state_dict=True).half()

        if tokenizer is not None:
            tokenizer.save_pretrained(multi_gpu_model_cache_dir)
        print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")

    return model

@Cherrysaber 按照你的这个代码改了，还是报同样的错误，是还改动其它地方了吗？

print(model.hf_device_map)
print(model.device)

然后把完整错误栈贴上来

xiaoweiweixiao commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

搞定了，用下面的代码

import os
from typing import Dict, Tuple, Union, Optional

from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
                       tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
    model = model.eval()

    device_map = auto_configure_device_map(num_gpus)
    try:
        model = load_checkpoint_and_dispatch(
            model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
    except ValueError:
        # index.json not found
        print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")

        assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
        model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
        model = load_checkpoint_and_dispatch(
            model, multi_gpu_model_cache_dir, device_map=device_map,
            offload_folder="offload", offload_state_dict=True).half()

        if tokenizer is not None:
            tokenizer.save_pretrained(multi_gpu_model_cache_dir)
        print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")

    return model

@Cherrysaber 按照你的这个代码改了，还是报同样的错误，是还改动其它地方了吗？

print(model.hf_device_map)
print(model.device)

然后把完整错误栈贴上来

{'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0, 'transformer.layers.0': 0, 'transformer.layers.1': 0, 'transformer.layers.2': 0, 'transformer.layers.3': 0, 'transformer.layers.4': 0, 'transformer.layers.5': 0, 'transformer.layers.6': 0, 'transformer.layers.7': 0, 'transformer.layers.8': 0, 'transformer.layers.9': 0, 'transformer.layers.10': 0, 'transformer.layers.11': 0, 'transformer.layers.12': 0, 'transformer.layers.13': 1, 'transformer.layers.14': 1, 'transformer.layers.15': 1, 'transformer.layers.16': 1, 'transformer.layers.17': 1, 'transformer.layers.18': 1, 'transformer.layers.19': 1, 'transformer.layers.20': 1, 'transformer.layers.21': 1, 'transformer.layers.22': 1, 'transformer.layers.23': 1, 'transformer.layers.24': 1, 'transformer.layers.25': 1, 'transformer.layers.26': 1, 'transformer.layers.27': 1} cuda /opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( 0%| | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last): File "finetune.py", line 169, in main() File "finetune.py", line 160, in main trainer.train() File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/trainer.py", line 2645, in training_step loss = self.compute_loss(model, inputs) File "finetune.py", line 103, in compute_loss return model( File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply output.reraise() File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise raise exception RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, *kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/peft/peft_model.py", line 529, in forward return self.base_model( File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, *kwargs) File "/home/xxx/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1031, in forward transformer_outputs = self.transformer( File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/xxx/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 876, in forward layer_ret = layer( File "/opt/cxxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, *kwargs) File "/home/xxx/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 574, in forward attention_input = self.input_layernorm(hidden_states) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 190, in forward return F.layer_norm( File "/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2516, in layer_norm return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

@Cherrysaber 这是模型信息和报的错信息，大佬帮我看看是哪的问题？

Cherrysaber commented 1 year ago

@xiaoweiweixiao 微调不行，你在一张卡上微调完，在部署到多卡。或者你自己写训练代码，中间把对Tensor操作时都放到同一device再操作

xiaoweiweixiao commented 1 year ago

@xiaoweiweixiao 微调不行，你在一张卡上微调完，在部署到多卡。或者你自己写训练代码，中间把对Tensor操作时都放到同一device再操作

哦哦，原来这只能用在部署上呀，”中间把对Tensor操作时都放到同一device再操作“这个能降低对单卡显存的要求吗？大佬考不考虑出个分布式训练的代码呀 O.O

huangjiaheng commented 1 year ago

Is your feature request related to a problem? Please describe.

https://github.com/ChuangLee/ChatGLM-6B-multiGPU 自动平均分配显存。之前单卡要13GB，很多GPU刚好用不了，很尴尬。

Solutions

https://github.com/ChuangLee/ChatGLM-6B-multiGPU 已解决。

Additional context

No response

人家本来就支持多卡部署啊

ChuangLee commented 1 year ago

人家本来就支持多卡部署啊原始代码会用一张卡加载全部模型吧，两张12GB的显卡，部署会OOM

yuanzhoulvpi2017 commented 1 year ago

添加了单机多卡训练代码，链接放在这里，https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/Chatglm6b_ModelParallel

cywjava commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices...

训练采用2张、四张都可以。

我已经解决了，图，是使用GPU0 和 1 训练， 7做生成文本。

使用4卡训练，也没有问题

直达===> https://github.com/chenyiwan/chatglm-6b-fine-tuning

cywjava commented 1 year ago

@xiaoweiweixiao 微调不行，你在一张卡上微调完，在部署到多卡。或者你自己写训练代码，中间把对Tensor操作时都放到同一device再操作

哦哦，原来这只能用在部署上呀，”中间把对Tensor操作时都放到同一device再操作“这个能降低对单卡显存的要求吗？大佬考不考虑出个分布式训练的代码呀 O.O

不是的，训练也可以多卡的啊

iceqing commented 1 year ago

采用多卡训练时进行了报错:transformer.prefix_encoder.embedding.weight doesn't have any device set.，有遇到这种情况的吗？报错堆栈如下

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:52<00:00,  6.56s/it]
[INFO|modeling_utils.py:3032] 2023-04-16 14:40:32,653 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[WARNING|modeling_utils.py:3034] 2023-04-16 14:40:32,653 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at THUDM/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2690] 2023-04-16 14:40:33,344 >> Generation config file not found, using a generation config created from the model config.
index.json not found, auto fixing and saving model to ./temp_model_dir ...
[INFO|configuration_utils.py:457] 2023-04-16 14:40:35,243 >> Configuration saved in ./temp_model_dir/config.json
[INFO|configuration_utils.py:362] 2023-04-16 14:40:35,250 >> Configuration saved in ./temp_model_dir/generation_config.json
[INFO|modeling_utils.py:1770] 2023-04-16 14:41:20,236 >> The model is bigger than the maximum size per checkpoint (2GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at ./temp_model_dir/pytorch_model.bin.index.json.
Traceback (most recent call last):
  File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/chatglm_parallel.py", line 50, in load_model_on_gpus
    model = load_checkpoint_and_dispatch(
  File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 899, in load_checkpoint_in_model
    raise ValueError(
ValueError: `checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded checkpoint, or a folder containing a sharded checkpoint, but got THUDM/chatglm-6b.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/main.py", line 432, in <module>
    main()
  File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/main.py", line 127, in main
    model = load_model_on_gpus(model_args.model_name_or_path,config=config, num_gpus=2)
  File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/chatglm_parallel.py", line 58, in load_model_on_gpus
    model = load_checkpoint_and_dispatch(
  File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 935, in load_checkpoint_in_model
    raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: transformer.prefix_encoder.embedding.weight doesn't have any device set.

duzx16 commented 1 year ago

已经 merge 到 main 分支了。非常感谢。

ZhaoyingAC commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

搞定了，用下面的代码

import os
from typing import Dict, Tuple, Union, Optional

from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
                       tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
    model = model.eval()

    device_map = auto_configure_device_map(num_gpus)
    try:
        model = load_checkpoint_and_dispatch(
            model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
    except ValueError:
        # index.json not found
        print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")

        assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
        model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
        model = load_checkpoint_and_dispatch(
            model, multi_gpu_model_cache_dir, device_map=device_map,
            offload_folder="offload", offload_state_dict=True).half()

        if tokenizer is not None:
            tokenizer.save_pretrained(multi_gpu_model_cache_dir)
        print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")

    return model

@Cherrysaber 按照你的这个代码改了，还是报同样的错误，是还改动其它地方了吗？

print(model.hf_device_map)
print(model.device)

然后把完整错误栈贴上来

大佬，按照官方提供的多卡部署，在predict时也碰到了类似的问题。 return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

提了一个issue https://github.com/THUDM/ChatGLM-6B/issues/929

zhupeijie commented 1 year ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices...

训练采用2张、四张都可以。

我已经解决了，图，是使用GPU0 和 1 训练， 7做生成文本。

使用4卡训练，也没有问题

直达===> https://github.com/chenyiwan/chatglm-6b-fine-tuning

麻烦问下，您这个多卡训练，也是必须保证第一张显卡的显存是能放下整个模型的是么？还是说和官方的多卡部署一样，是把模型均分到多个显卡的？

StanleyOf427 commented 8 months ago

我也遇到了同样的报错： Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。

搞定了，用下面的代码

import os
from typing import Dict, Tuple, Union, Optional

from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer

def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map

def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
                       tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
    model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
    model = model.eval()

    device_map = auto_configure_device_map(num_gpus)
    try:
        model = load_checkpoint_and_dispatch(
            model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
    except ValueError:
        # index.json not found
        print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")

        assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
        model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
        model = load_checkpoint_and_dispatch(
            model, multi_gpu_model_cache_dir, device_map=device_map,
            offload_folder="offload", offload_state_dict=True).half()

        if tokenizer is not None:
            tokenizer.save_pretrained(multi_gpu_model_cache_dir)
        print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")

    return model

我的运行环境是python3.8，pytorch 2.1.2+cu118，NVCC CUDA12.0，驱动546.17，CUDA12.3，硬件P102-100两张。遇到了相同的Expected all tensors to be on the same device, but found at least two devices...报错，使用上面的方法可以解决问题。把该函数替换直接粘过去替换原文件的对应函数即可。

zhupeijie commented 8 months ago

您好，邮件已收到，我近期会登录查看，祝您天天开心！