baichuan-inc / Baichuan-7B

A large-scale 7B pretraining language model developed by BaiChuan-Inc.
https://huggingface.co/baichuan-inc/baichuan-7B
Apache License 2.0
5.67k stars 504 forks source link

[Question] 单机单卡训练,报错,无法初始化梯度。 #109

Open xkjcf opened 1 year ago

xkjcf commented 1 year ago

Required prerequisites

Questions

下载了model,创建了data_dir目录,创建了一个新的script/train2.sh脚本。 `

!/bin/bash

deepspeed train.py \ --deepspeed \ --deepspeed_config config/deepspeed.json 运行该脚本,报如下的错误: Traceback (most recent call last): File "/root/code/Baichuan-7B/train.py", line 138, in model_engine = prepare_model() File "/root/code/Baichuan-7B/train.py", line 117, in prepare_model modelengine, , , = deepspeed.initialize(args=args, File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1409, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 468, in init self.initialize_gradient_partitioning_data_structures() File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 691, in initialize_gradient_partitioning_data_structures self.first_param_index_in_partition[i][partition_id] = self.get_first_param_index( File "/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 666, in get_first_param_index if partition_id in self.param_to_partition_ids[group_id][param_id]: KeyError: 0 `

data_dir中的训练文档为普通的多行文本。

Checklist

DoliteMatheo commented 1 year ago

同问,Python 3.9也不行,换了机器也不行。

LiManshiang commented 1 year ago

同问,遇到了相同的问题。 另一个问题时requirement 中版本有冲突 The conflict is caused by: The user requested torch==2.0.0 deepspeed 0.9.2 depends on torch xformers 0.0.20 depends on torch==2.0.1

Aurora-slz commented 1 year ago

+1

kztao commented 1 year ago

同问

hingkan commented 1 year ago

同问,遇到了相同的问题。 另一个问题时requirement 中版本有冲突 The conflict is caused by: The user requested torch==2.0.0 deepspeed 0.9.2 depends on torch xformers 0.0.20 depends on torch==2.0.1

我在其他issue里也看到了,安装的也是torch==2.0.1,但仍然出现上面的问题。请问大家是如何解决的呢?

xinruozhang575 commented 1 year ago

我也遇到了同样的问题,在deepspeed issues中有找到相关说明https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题

Silentssss commented 10 months ago

我也遇到了同样的问题,在deepspeed issues中有找到相关说明https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题

image 按你的方法修改后有新的报错,你有遇到吗

ucaslei commented 3 months ago

我也遇到了同样的问题,在deepspeed issues中有找到相关说明https://github.com/microsoft/DeepSpeed/issues/3234,ZeRO stage 3支持zero.init,stage 1和2不支持,我把deepspeed.json中stage改成3解决了这个问题

image 按你的方法修改后有新的报错,你有遇到吗

我也遇到这个问题了 有解决方法吗