THUDM / VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型
Apache License 2.0
4.07k stars 414 forks source link

调用finetune_224_lora.sh时报错please pass LOCAL_WORLD_SIZE environment variable. #298

Open xuxuxuchen opened 10 months ago

xuxuxuchen commented 10 months ago

已经在.deepspeed_env中加入了 LOCAL_WORLD_SIZE 的环境变量,但是模型中依然报错,请问怎么解决呢?

1049451037 commented 10 months ago

报什么错

xuxuxuchen commented 10 months ago

.deepspeed_env放在CogVLM-main文件夹下 写了:SAT_HOME=~/.sat_models LOCAL_WORLD_SIZE=8

报错信息为:File "/work/home/CogVLM-main/finetune_demo.py", line 239, in model, args = FineTuneTrainCogVLMModel.from_pretrained(args.from_pretrained, args, overwrite_args={'model_parallel_size': args.model_parallel_size} if args.model_parallel_size != 1 else {}) File "/opt/conda/lib/python3.10/site-packages/sat/model/base_model.py", line 220, in from_pretrained local_rank = get_node_rank() File "/opt/conda/lib/python3.10/site-packages/sat/mpu/initialize.py", line 144, in get_node_rank return torch.distributed.get_rank(group=get_node_group()) File "/opt/conda/lib/python3.10/site-packages/sat/mpu/initialize.py", line 122, in get_node_group assert _NODE_GROUP is not None, \ AssertionError: node group is not initialized, please pass LOCAL_WORLD_SIZE environment variable.

xuxuxuchen commented 10 months ago

使用了单机4卡的a100,显存为40G的

1049451037 commented 10 months ago

单机不会用到.deepspeed_env文件,这个文件只有多机才会触发。

1049451037 commented 10 months ago

可以尝试安装github最新版sat:

git clone https://github.com/THUDM/SwissArmyTransformer
cd SwissArmyTransformer
pip install .

新版已经解决了这个问题。

xuxuxuchen commented 10 months ago

重装之后出现这个报错:TypeError: BaseFileLock.init() got an unexpected keyword argument 'mode' model, args = FineTuneTrainCogVLMModel.from_pretrained(args.from_pretrained, args, overwrite_args={'model_parallel_size': args.model_parallel_size} if args.model_parallel_size != 1 else {}) File "/opt/conda/lib/python3.10/site-packages/sat/model/base_model.py", line 219, in from_pretrained model, model_args = cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=True, overwrite_args=overwrite_args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/sat/model/base_model.py", line 201, in from_pretrained_base model_path = auto_create(name, path=home_path, url=url) File "/opt/conda/lib/python3.10/site-packages/sat/resources/download.py", line 50, in auto_create lock = FileLock(model_path + '.lock', mode=0o777) TypeError: BaseFileLock.init() got an unexpected keyword argument 'mode' 如果将mode=0o777删掉,模型运行,继续报LOCAL_WORLD_SIZE 的错

1049451037 commented 10 months ago

pip install -U filelock

不要随便删sat代码~

1049451037 commented 10 months ago

而且你报了这个filelock的错误,说明你之前的sat版本已经特别老了,建议重新下载模型,因为随着时间推移,有些模型权重格式也已经变了。

xuxuxuchen commented 10 months ago

1.好的好的下次不会随便自作聪明动代码了。 2.pip install -U filelock之后运行继续报LOCAL_WORLD_SIZE 的错。 3.怎么重新下载模型呢,因为服务器没法用gitclone,我刚才是在git上下了sat的zip,解压后在命令行里运行pip install .的,运行之前还pip uninstall 原来的,现在pip list中显示SwissArmyTransformer 0.4.8。

1049451037 commented 10 months ago

说明sat没安装对版本,新版sat已经不需要LOCAL_WORLD_SIZE这个参数了。请严格按照这个步骤:

git clone https://github.com/THUDM/SwissArmyTransformer
cd SwissArmyTransformer
pip install .
1049451037 commented 10 months ago

https://github.com/THUDM/SwissArmyTransformer/blob/2f1a73f5dc5789b103972370b7b12d70b15c8d08/sat/mpu/initialize.py#L84-L89

xuxuxuchen commented 10 months ago

1.好的,那我去找一下替代gitclone的办法。 2.我看到您分享的这个链接里的line84-91依然有:os.environ.get('LOCAL_WORLD_SIZE', None)这个环境变量呀。而且我的报错是line126的assert“'node group is not initialized, please pass LOCAL_WORLD_SIZE environment ”,其实并不和LOCAL_WORLD_SIZE直接相关。 guess_local_world_size = world_size if world_size < 8 else 8 local_world_size = os.environ.get('LOCAL_WORLD_SIZE', None) if local_world_size is None: local_world_size = guess_local_world_size print_rank0(f"You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE={guess_local_world_size}. If this is wrong, please pass the LOCAL_WORLD_SIZE manually.") local_world_size = int(local_world_size)

Build the node groups.

global _NODE_GROUP

line126 assert _NODE_GROUP is not None, \ 'node group is not initialized, please pass LOCAL_WORLD_SIZE environment variable.' return _NODE_GROUP

1049451037 commented 10 months ago

就是因为我给的那个链接那几行代码没运行,才导致126行报错了。说明你本地安装的代码和我给的链接里的代码不一样。

xuxuxuchen commented 10 months ago

您好,不好意思又来打扰您了。 因为服务器上的确没办法用gitclone,所以我在本地gitclone了之后复制到服务器里再用的pip install .,但是还是报一模一样的错,依然是AssertionError: node group is not initialized, please pass LOCAL_WORLD_SIZE environment variable.,请问您能帮我想想还有什么其他可能导致错误的原因吗?

1049451037 commented 10 months ago

https://github.com/THUDM/CogVLM/blob/main/scripts/finetune_224_lora.sh#L5

这里的8改成4。因为你只有4张卡。

xuxuxuchen commented 10 months ago

已改了,可以运行了,谢谢大佬