llama-33B/llama-65B均报OOM，8*V100跑不起来怎么回事呢？ - Githubissues

OpenLMLab / LOMO

LOMO: LOw-Memory Optimization

MIT License

974 stars 68 forks source link

llama-33B/llama-65B均报OOM，8*V100跑不起来怎么回事呢？ #28

Open alisyzhu opened 1 year ago

alisyzhu commented 1 year ago

环境：8 * V100 (32G) 执行run.sh 【错误log】

【LOMO模式】 args_lomo.yaml配置：

ds_config.json配置：

【LOMO+LORA模式】 args_lomo_lora.yaml配置：

ds_config_lora.json

KaiLv69 commented 1 year ago

hi, 麻烦提供一下run.sh和更完整的错误log~

alisyzhu commented 1 year ago

hi, 麻烦提供一下run.sh和更完整的错误log~

run.sh脚本：

【错误log】

KaiLv69 commented 1 year ago

run.sh脚本：

现在只用了一张GPU，应该设置--include localhost:0,1,2,3,4,5,6,7来使用所有的GPU

alisyzhu commented 1 year ago

run.sh脚本：

现在只用了一张GPU，应该设置--include localhost:0,1,2,3,4,5,6,7来使用所有的GPU

大意了，只看error部分的信息了；请问，如果我想用多机多卡，这个localhost这里该怎么配置呢？

KaiLv69 commented 1 year ago

可以参考https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

alisyzhu commented 1 year ago

可以参考https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

好的，感谢。

00drdelius commented 1 year ago

可以参考https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node

3张3090训练13B报OOM👇 f6e5cf36d76a53c8406474379c19ad6

21824581f1586f4099ee3cce12ca852

参数配置如下： args_lomo.yaml: 5d2cf71a8467d2d7e6077dff8f7089a

ds_config.json: af3b0630917ff13060762871a1a7a48

run.sh: 2105cb61cd9688667660d8376e61a0f

跑得是baichuan-13b。对源码的修改我就添加了loss在0.46以下时保存在一个特殊的output directory： e7e48cfac59a152ed071b7aa50c7d9b

这咋弄呀