在A40（46G，4张）上微调失败

THUDM / VisualGLM-6B

Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型

Apache License 2.0

4.07k stars 414 forks source link

在A40（46G，4张）上微调失败 #295

Closed cdqncn closed 10 months ago

cdqncn commented 10 months ago

使用命令bash finetune/finetune_visualglm.sh进行微调，最后程序直接exits with return code=-7退出了，没法进行错误溯源，求大佬解答 bug

1049451037 commented 10 months ago

应该是内存不够导致被kill了。可以把代码里args.device = 'cpu'改成args.device = 'cuda'试一下。

cdqncn commented 10 months ago

改成args.device = 'cuda'还是一样的错误，跑不起来

a252999ba1a0641fef4faccd6add51b

1049451037 commented 10 months ago

那就安装一下github最新的sat：

git clone https://github.com/THUDM/SwissArmyTransformer
cd SwissArmyTransformer
pip install .

https://github.com/THUDM/VisualGLM-6B/blob/f4429a009ee533b76e8757dce6917fbf0b0408f9/finetune_visualglm.py#L178

然后把这段代码变成：

model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args, overwrite_args={'model_parallel_size': 1})

cdqncn commented 10 months ago

还是不行，哎，太难了

1049451037 commented 10 months ago

那就安装一下github最新的sat：
git clone https://github.com/THUDM/SwissArmyTransformer
cd SwissArmyTransformer
pip install .
https://github.com/THUDM/VisualGLM-6B/blob/f4429a009ee533b76e8757dce6917fbf0b0408f9/finetune_visualglm.py#L178

然后把这段代码变成：
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args, overwrite_args={'model_parallel_size': 1})

需要更新sat。必须按照我上面写的更新github版本的sat，不然就会报你看到的这个错误。

cdqncn commented 10 months ago

我直接下载的

然后安装的，和github版本有区别嘛

1049451037 commented 10 months ago

有区别啊，就是为了给你解决这个问题才更新的github代码

cdqncn commented 10 months ago

我公司堡垒机没网络，可以用自己的电脑执行你的这段代码，把SwissArmyTransformer-0.4.8-py3-none-any.whl文件下下来，然后带上去吗

1049451037 commented 10 months ago

可以把git仓库的代码拷贝下来带上去

1049451037 commented 10 months ago

看不到你上传的图片

cdqncn commented 10 months ago

抱歉，刚删了，这个还是有问题，您是在容器里跑的吗？还是本机跑的

cdqncn commented 10 months ago

可以了兄弟，我是因为docker分配的内存不够，加上she size 60G就好了