Closed ChuangLee closed 1 year ago
加载量化后的int4模型会报错:
多卡时,推理速度会有提升吗?
加载量化后的int4模型会报错:
这是因为路径不对吧?不过都量化int4了还需要多卡吗? 没有测试。
多卡时,推理速度会有提升吗?
理论上会下降,因为涉及到GPU之间传输数据,不过经测试差不太多。
如何运行支持多GPU部署? 确实会报错
Traceback (most recent call last):
File "web_demo.py", line 6, in <module>
model = load_model_on_gpus("THUDM/chatglm-6b", num_gpus=4)
File "/data/chenhao/codes/ChatGLM-6B/chatglm_parallel.py", line 34, in load_model_on_gpus
model = load_checkpoint_and_dispatch(
File "/data/chenhao/anaconda3/envs/ChatGLM-6B/lib/python3.8/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/data/chenhao/anaconda3/envs/ChatGLM-6B/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 899, in load_checkpoint_in_model
raise ValueError(
ValueError: `checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded checkpoint, or a folder containing a sharded checkpoint, but got THUDM/chatglm-6b.
@ChuangLee 您可以运行成功吗?
加载量化后的int4模型会报错:
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True)
model.save_pretrained(“./multi_gpus”,max_shard_size='2GB')
先用python运行上面两行代码,在运行webui就行了,模型路径填 _“./multi_gpus”_
加载量化后的int4模型会报错:
这是因为路径不对吧?不过都量化int4了还需要多卡吗? 没有测试。
还是非常必要的,max_tokens直接和显存大小相关,int4模型能记录的上下文在相同配置下,远超正常模型。
加载量化后的int4模型会报错:
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True)
model.save_pretrained(“./multi_gpus”,max_shard_size='2GB')
先用python运行上面两行代码,在运行webui就行了,模型路径填 _“./multi_gpus”_
这样确实可以跑起来,但是有出现了新问题 确实是4张卡
错误信息
代码
界面上没有应答
加载量化后的int4模型会报错:
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True)
model.save_pretrained(“./multi_gpus”,max_shard_size='2GB')
先用python运行上面两行代码,在运行webui就行了,模型路径填 _“./multi_gpus”_这样确实可以跑起来,但是有出现了新问题 确实是4张卡
错误信息
代码
界面上没有应答
试试下面的,我两个gpu是正常的。 不行的话,把整个错误栈都贴上来
num_trans_layers = 28
device_map = {'transformer.word_embeddings': 0,
'transformer.final_layernorm': num_gpus - 1, 'lm_head': num_gpus - 1}
gpu_target = 0
for index in range(num_trans_layers):
if index % num_gpus != 0:
gpu_target += 1
else:
gpu_target = 0
device_map[f'transformer.layers.{index}'] = gpu_target
device_map['transformer.layers.27'] = num_gpus - 1
"THUDM/chatglm-6b"
注意这个路径需要你是模型的路径,我这里是相对路径,放到了当前文件夹下。
加载量化后的int4模型会报错:
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True)
model.save_pretrained(“./multi_gpus”,max_shard_size='2GB')
先用python运行上面两行代码,在运行webui就行了,模型路径填 _“./multi_gpus”_这样确实可以跑起来,但是有出现了新问题 确实是4张卡 错误信息 代码 界面上没有应答
试试下面的,我两个gpu是正常的。 不行的话,把整个错误栈都贴上来
num_trans_layers = 28 device_map = {'transformer.word_embeddings': 0, 'transformer.final_layernorm': num_gpus - 1, 'lm_head': num_gpus - 1} gpu_target = 0 for index in range(num_trans_layers): if index % num_gpus != 0: gpu_target += 1 else: gpu_target = 0 device_map[f'transformer.layers.{index}'] = gpu_target device_map['transformer.layers.27'] = num_gpus - 1
我的代码测试过2~8个GPU都可以正常工作的,你只把参数改num_gpus=4不行吗? 你用%模数指定gpu运行没有问题,但是疑似GPU之间的通信频率要增加好几倍,理论上效率会慢些。
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices...
使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices...
使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
把整个错误栈贴上来
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
把整个错误栈贴上来
(.venv) ChatGLM-6B-multiGPU | lurenjiasworld@thinkpad-neo14 [#] python web_demo.py
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:13<00:00, 1.73s/it]
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Opening in existing browser session.
Traceback (most recent call last):
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/routes.py", line 394, in run_predict
output = await app.get_blocks().process_api(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/blocks.py", line 1075, in process_api
result = await self.call_function(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/blocks.py", line 898, in call_function
prediction = await anyio.to_thread.run_sync(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/gradio/utils.py", line 549, in async_iteration
return next(iterator)
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/web_demo.py", line 15, in predict
for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 1152, in stream_chat
for outputs in self.stream_generate(**input_ids, **gen_kwargs):
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 1229, in stream_generate
outputs = self(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 1031, in forward
transformer_outputs = self.transformer(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/lurenjiasworld/.cache/huggingface/modules/transformers_modules/local/modeling_chatglm.py", line 851, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/mnt/data/Workspace/ChatGLM/ChatGLM-6B-multiGPU/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices...
使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
试一下Cli的demo是否能正常运行。
Is your feature request related to a problem? Please describe.
https://github.com/ChuangLee/ChatGLM-6B-multiGPU 自动平均分配显存。 之前单卡要13GB,很多GPU刚好用不了,很尴尬。
Solutions
https://github.com/ChuangLee/ChatGLM-6B-multiGPU 已解决。
Additional context
No response
大老,如果多个cpu怎么支持,我是64核的cpu
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
试一下Cli的demo是否能正常运行。
我windows都是正常的,切到wsl ubuntu就和他们一样报错 我hook了torch.embedding 发现 weight 在wsl ubuntu被分配到了 cuda:1 , input却在cuda:0
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices...
使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
搞定了,用下面的代码
import os
from typing import Dict, Tuple, Union, Optional
from accelerate import load_checkpoint_and_dispatch
from torch.nn import Module
from transformers import AutoModel, AutoTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer
def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
# transformer.word_embeddings 占用1层
# transformer.final_layernorm 和 lm_head 占用1层
# transformer.layers 占用 28 层
# 总共30层分配到num_gpus张卡上
num_trans_layers = 28
per_gpu_layers = 30 / num_gpus
# bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
# windows下 model.device 会被设置成 transformer.word_embeddings.device
# linux下 model.device 会被设置成 lm_head.device
# 在调用chat或者stream_chat时,input_ids会被放到model.device上
# 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
# 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
device_map = {'transformer.word_embeddings': 0,
'transformer.final_layernorm': 0, 'lm_head': 0}
used = 2
gpu_target = 0
for i in range(num_trans_layers):
if used >= per_gpu_layers:
gpu_target += 1
used = 0
assert gpu_target < num_gpus
device_map[f'transformer.layers.{i}'] = gpu_target
used += 1
return device_map
def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir",
tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module:
model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs)
model = model.eval()
device_map = auto_configure_device_map(num_gpus)
try:
model = load_checkpoint_and_dispatch(
model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half()
except ValueError:
# index.json not found
print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...")
assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None"
model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB')
model = load_checkpoint_and_dispatch(
model, multi_gpu_model_cache_dir, device_map=device_map,
offload_folder="offload", offload_state_dict=True).half()
if tokenizer is not None:
tokenizer.save_pretrained(multi_gpu_model_cache_dir)
print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time")
return model
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
搞定了,用下面的代码
import os from typing import Dict, Tuple, Union, Optional from accelerate import load_checkpoint_and_dispatch from torch.nn import Module from transformers import AutoModel, AutoTokenizer from transformers.tokenization_utils import PreTrainedTokenizer def auto_configure_device_map(num_gpus: int) -> Dict[str, int]: # transformer.word_embeddings 占用1层 # transformer.final_layernorm 和 lm_head 占用1层 # transformer.layers 占用 28 层 # 总共30层分配到num_gpus张卡上 num_trans_layers = 28 per_gpu_layers = 30 / num_gpus # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError # windows下 model.device 会被设置成 transformer.word_embeddings.device # linux下 model.device 会被设置成 lm_head.device # 在调用chat或者stream_chat时,input_ids会被放到model.device上 # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上 device_map = {'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0} used = 2 gpu_target = 0 for i in range(num_trans_layers): if used >= per_gpu_layers: gpu_target += 1 used = 0 assert gpu_target < num_gpus device_map[f'transformer.layers.{i}'] = gpu_target used += 1 return device_map def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir", tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module: model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs) model = model.eval() device_map = auto_configure_device_map(num_gpus) try: model = load_checkpoint_and_dispatch( model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() except ValueError: # index.json not found print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...") assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None" model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB') model = load_checkpoint_and_dispatch( model, multi_gpu_model_cache_dir, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() if tokenizer is not None: tokenizer.save_pretrained(multi_gpu_model_cache_dir) print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time") return model
@Cherrysaber 按照你的这个代码改了,还是报同样的错误,是还改动其它地方了吗?
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
搞定了,用下面的代码
import os from typing import Dict, Tuple, Union, Optional from accelerate import load_checkpoint_and_dispatch from torch.nn import Module from transformers import AutoModel, AutoTokenizer from transformers.tokenization_utils import PreTrainedTokenizer def auto_configure_device_map(num_gpus: int) -> Dict[str, int]: # transformer.word_embeddings 占用1层 # transformer.final_layernorm 和 lm_head 占用1层 # transformer.layers 占用 28 层 # 总共30层分配到num_gpus张卡上 num_trans_layers = 28 per_gpu_layers = 30 / num_gpus # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError # windows下 model.device 会被设置成 transformer.word_embeddings.device # linux下 model.device 会被设置成 lm_head.device # 在调用chat或者stream_chat时,input_ids会被放到model.device上 # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上 device_map = {'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0} used = 2 gpu_target = 0 for i in range(num_trans_layers): if used >= per_gpu_layers: gpu_target += 1 used = 0 assert gpu_target < num_gpus device_map[f'transformer.layers.{i}'] = gpu_target used += 1 return device_map def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir", tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module: model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs) model = model.eval() device_map = auto_configure_device_map(num_gpus) try: model = load_checkpoint_and_dispatch( model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() except ValueError: # index.json not found print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...") assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None" model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB') model = load_checkpoint_and_dispatch( model, multi_gpu_model_cache_dir, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() if tokenizer is not None: tokenizer.save_pretrained(multi_gpu_model_cache_dir) print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time") return model
@Cherrysaber 按照你的这个代码改了,还是报同样的错误,是还改动其它地方了吗?
print(model.hf_device_map)
print(model.device)
然后把完整错误栈贴上来
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
搞定了,用下面的代码
import os from typing import Dict, Tuple, Union, Optional from accelerate import load_checkpoint_and_dispatch from torch.nn import Module from transformers import AutoModel, AutoTokenizer from transformers.tokenization_utils import PreTrainedTokenizer def auto_configure_device_map(num_gpus: int) -> Dict[str, int]: # transformer.word_embeddings 占用1层 # transformer.final_layernorm 和 lm_head 占用1层 # transformer.layers 占用 28 层 # 总共30层分配到num_gpus张卡上 num_trans_layers = 28 per_gpu_layers = 30 / num_gpus # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError # windows下 model.device 会被设置成 transformer.word_embeddings.device # linux下 model.device 会被设置成 lm_head.device # 在调用chat或者stream_chat时,input_ids会被放到model.device上 # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上 device_map = {'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0} used = 2 gpu_target = 0 for i in range(num_trans_layers): if used >= per_gpu_layers: gpu_target += 1 used = 0 assert gpu_target < num_gpus device_map[f'transformer.layers.{i}'] = gpu_target used += 1 return device_map def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir", tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module: model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs) model = model.eval() device_map = auto_configure_device_map(num_gpus) try: model = load_checkpoint_and_dispatch( model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() except ValueError: # index.json not found print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...") assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None" model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB') model = load_checkpoint_and_dispatch( model, multi_gpu_model_cache_dir, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() if tokenizer is not None: tokenizer.save_pretrained(multi_gpu_model_cache_dir) print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time") return model
@Cherrysaber 按照你的这个代码改了,还是报同样的错误,是还改动其它地方了吗?
print(model.hf_device_map) print(model.device)
然后把完整错误栈贴上来
{'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0, 'transformer.layers.0': 0, 'transformer.layers.1': 0, 'transformer.layers.2': 0, 'transformer.layers.3': 0, 'transformer.layers.4': 0, 'transformer.layers.5': 0, 'transformer.layers.6': 0, 'transformer.layers.7': 0, 'transformer.layers.8': 0, 'transformer.layers.9': 0, 'transformer.layers.10': 0, 'transformer.layers.11': 0, 'transformer.layers.12': 0, 'transformer.layers.13': 1, 'transformer.layers.14': 1, 'transformer.layers.15': 1, 'transformer.layers.16': 1, 'transformer.layers.17': 1, 'transformer.layers.18': 1, 'transformer.layers.19': 1, 'transformer.layers.20': 1, 'transformer.layers.21': 1, 'transformer.layers.22': 1, 'transformer.layers.23': 1, 'transformer.layers.24': 1, 'transformer.layers.25': 1, 'transformer.layers.26': 1, 'transformer.layers.27': 1}
cuda
/opt/xxx/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True
to disable this warning
warnings.warn(
0%| | 0/50000 [00:00<?, ?it/s]Traceback (most recent call last):
File "finetune.py", line 169, in
@Cherrysaber 这是模型信息和报的错信息,大佬帮我看看是哪的问题?
@xiaoweiweixiao 微调不行,你在一张卡上微调完,在部署到多卡。或者你自己写训练代码,中间把对Tensor操作时都放到同一device再操作
@xiaoweiweixiao 微调不行,你在一张卡上微调完,在部署到多卡。或者你自己写训练代码,中间把对Tensor操作时都放到同一device再操作
哦哦,原来这只能用在部署上呀,”中间把对Tensor操作时都放到同一device再操作“这个能降低对单卡显存的要求吗?大佬考不考虑出个分布式训练的代码呀 O.O
Is your feature request related to a problem? Please describe.
https://github.com/ChuangLee/ChatGLM-6B-multiGPU 自动平均分配显存。 之前单卡要13GB,很多GPU刚好用不了,很尴尬。
Solutions
https://github.com/ChuangLee/ChatGLM-6B-multiGPU 已解决。
Additional context
No response
人家本来就支持多卡部署啊
人家本来就支持多卡部署啊 原始代码会用一张卡加载全部模型吧,两张12GB的显卡,部署会OOM
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices...
训练采用2张、四张都可以。
我已经解决了,图,是使用GPU0 和 1 训练, 7做生成文本。
使用4卡训练,也没有问题
@xiaoweiweixiao 微调不行,你在一张卡上微调完,在部署到多卡。或者你自己写训练代码,中间把对Tensor操作时都放到同一device再操作
哦哦,原来这只能用在部署上呀,”中间把对Tensor操作时都放到同一device再操作“这个能降低对单卡显存的要求吗?大佬考不考虑出个分布式训练的代码呀 O.O
不是的,训练也可以多卡的啊
采用多卡训练时进行了报错:transformer.prefix_encoder.embedding.weight doesn't have any device set.
,有遇到这种情况的吗?报错堆栈如下
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:52<00:00, 6.56s/it]
[INFO|modeling_utils.py:3032] 2023-04-16 14:40:32,653 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[WARNING|modeling_utils.py:3034] 2023-04-16 14:40:32,653 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at THUDM/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2690] 2023-04-16 14:40:33,344 >> Generation config file not found, using a generation config created from the model config.
index.json not found, auto fixing and saving model to ./temp_model_dir ...
[INFO|configuration_utils.py:457] 2023-04-16 14:40:35,243 >> Configuration saved in ./temp_model_dir/config.json
[INFO|configuration_utils.py:362] 2023-04-16 14:40:35,250 >> Configuration saved in ./temp_model_dir/generation_config.json
[INFO|modeling_utils.py:1770] 2023-04-16 14:41:20,236 >> The model is bigger than the maximum size per checkpoint (2GB) and is going to be split in 7 checkpoint shards. You can find where each parameters has been saved in the index located at ./temp_model_dir/pytorch_model.bin.index.json.
Traceback (most recent call last):
File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/chatglm_parallel.py", line 50, in load_model_on_gpus
model = load_checkpoint_and_dispatch(
File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 899, in load_checkpoint_in_model
raise ValueError(
ValueError: `checkpoint` should be the path to a file containing a whole state dict, or the index of a sharded checkpoint, or a folder containing a sharded checkpoint, but got THUDM/chatglm-6b.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/main.py", line 432, in <module>
main()
File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/main.py", line 127, in main
model = load_model_on_gpus(model_args.model_name_or_path,config=config, num_gpus=2)
File "/home/ice/ChatGLM/ChatGLM-6B/ptuning/chatglm_parallel.py", line 58, in load_model_on_gpus
model = load_checkpoint_and_dispatch(
File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/ice/ChatGLM/ChatGLM-6B/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 935, in load_checkpoint_in_model
raise ValueError(f"{param_name} doesn't have any device set.")
ValueError: transformer.prefix_encoder.embedding.weight doesn't have any device set.
已经 merge 到 main 分支了。非常感谢。
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
搞定了,用下面的代码
import os from typing import Dict, Tuple, Union, Optional from accelerate import load_checkpoint_and_dispatch from torch.nn import Module from transformers import AutoModel, AutoTokenizer from transformers.tokenization_utils import PreTrainedTokenizer def auto_configure_device_map(num_gpus: int) -> Dict[str, int]: # transformer.word_embeddings 占用1层 # transformer.final_layernorm 和 lm_head 占用1层 # transformer.layers 占用 28 层 # 总共30层分配到num_gpus张卡上 num_trans_layers = 28 per_gpu_layers = 30 / num_gpus # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError # windows下 model.device 会被设置成 transformer.word_embeddings.device # linux下 model.device 会被设置成 lm_head.device # 在调用chat或者stream_chat时,input_ids会被放到model.device上 # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上 device_map = {'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0} used = 2 gpu_target = 0 for i in range(num_trans_layers): if used >= per_gpu_layers: gpu_target += 1 used = 0 assert gpu_target < num_gpus device_map[f'transformer.layers.{i}'] = gpu_target used += 1 return device_map def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir", tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module: model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs) model = model.eval() device_map = auto_configure_device_map(num_gpus) try: model = load_checkpoint_and_dispatch( model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() except ValueError: # index.json not found print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...") assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None" model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB') model = load_checkpoint_and_dispatch( model, multi_gpu_model_cache_dir, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() if tokenizer is not None: tokenizer.save_pretrained(multi_gpu_model_cache_dir) print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time") return model
@Cherrysaber 按照你的这个代码改了,还是报同样的错误,是还改动其它地方了吗?
print(model.hf_device_map) print(model.device)
然后把完整错误栈贴上来
大佬,按照官方提供的多卡部署,在predict时也碰到了类似的问题。 return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices...
训练采用2张、四张都可以。
我已经解决了,图,是使用GPU0 和 1 训练, 7做生成文本。
使用4卡训练,也没有问题
麻烦问下,您这个多卡训练,也是必须保证第一张显卡的显存是能放下整个模型的是么?还是说和官方的多卡部署一样,是把模型均分到多个显卡的?
我也遇到了同样的报错: Expected all tensors to be on the same device, but found at least two devices... 使用模数和仓库里的代码都不可以正常运行。模型是从 https://cloud.tsinghua.edu.cn/d/fb9f16d6dc8f482596c2/ 这里下载的。
搞定了,用下面的代码
import os from typing import Dict, Tuple, Union, Optional from accelerate import load_checkpoint_and_dispatch from torch.nn import Module from transformers import AutoModel, AutoTokenizer from transformers.tokenization_utils import PreTrainedTokenizer def auto_configure_device_map(num_gpus: int) -> Dict[str, int]: # transformer.word_embeddings 占用1层 # transformer.final_layernorm 和 lm_head 占用1层 # transformer.layers 占用 28 层 # 总共30层分配到num_gpus张卡上 num_trans_layers = 28 per_gpu_layers = 30 / num_gpus # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError # windows下 model.device 会被设置成 transformer.word_embeddings.device # linux下 model.device 会被设置成 lm_head.device # 在调用chat或者stream_chat时,input_ids会被放到model.device上 # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上 device_map = {'transformer.word_embeddings': 0, 'transformer.final_layernorm': 0, 'lm_head': 0} used = 2 gpu_target = 0 for i in range(num_trans_layers): if used >= per_gpu_layers: gpu_target += 1 used = 0 assert gpu_target < num_gpus device_map[f'transformer.layers.{i}'] = gpu_target used += 1 return device_map def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2, multi_gpu_model_cache_dir: Union[str, os.PathLike] = "./temp_model_dir", tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs) -> Module: model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs) model = model.eval() device_map = auto_configure_device_map(num_gpus) try: model = load_checkpoint_and_dispatch( model, checkpoint_path, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() except ValueError: # index.json not found print(f"index.json not found, auto fixing and saving model to {multi_gpu_model_cache_dir} ...") assert multi_gpu_model_cache_dir is not None, "using auto fix, cache_dir must not be None" model.save_pretrained(multi_gpu_model_cache_dir, max_shard_size='2GB') model = load_checkpoint_and_dispatch( model, multi_gpu_model_cache_dir, device_map=device_map, offload_folder="offload", offload_state_dict=True).half() if tokenizer is not None: tokenizer.save_pretrained(multi_gpu_model_cache_dir) print(f"loading model successfully, you should use checkpoint_path={multi_gpu_model_cache_dir} next time") return model
我的运行环境是python3.8,pytorch 2.1.2+cu118,NVCC CUDA12.0,驱动546.17,CUDA12.3,硬件P102-100两张。遇到了相同的Expected all tensors to be on the same device, but found at least two devices...报错,使用上面的方法可以解决问题。把该函数替换直接粘过去替换原文件的对应函数即可。
您好,邮件已收到,我近期会登录查看,祝您天天开心!
Is your feature request related to a problem? Please describe.
https://github.com/ChuangLee/ChatGLM-6B-multiGPU 自动平均分配显存。 之前单卡要13GB,很多GPU刚好用不了,很尴尬。
Solutions
https://github.com/ChuangLee/ChatGLM-6B-multiGPU 已解决。
Additional context
No response