Closed sunnnnnnnny closed 8 months ago
hello,if convenient,I need more of your logs like the layers and their corresponding partition.
---- Replied Message ---- From sunnnnnnnny @.> Date 03/13/2024 15:25 To Coobiw/MiniGPT4Qwen @.> Cc Subscribed @.***> Subject [Coobiw/MiniGPT4Qwen] deepspeed training, meet the error "ValueError: optimizer got an empty parameter list" (Issue #13)
When i run ddp training code "python -m torch.distributed.run --nproc_per_node=2 --master_port=12233 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2",GPU0 Trainable Params: 3937280, GPU1 Trainable Params: 0.
Because one card is enough to measure parameters, this problem seems to occur. Is this what you mean?
GPU1 Trainable Params: 0 Traceback (most recent call last): File "train_pipeline.py", line 260, in
[2024-03-13 08:37:15,006] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method uniform
stage=0 layers=25
0: TokenizerPipeLayer
1: IndentityPipeLayer
2: IndentityPipeLayer
3: IndentityPipeLayer
4: IndentityPipeLayer
5: QwenBlockPipeLayer
6: QwenBlockPipeLayer
7: QwenBlockPipeLayer
8: QwenBlockPipeLayer
9: QwenBlockPipeLayer
10: QwenBlockPipeLayer
11: QwenBlockPipeLayer
12: QwenBlockPipeLayer
13: QwenBlockPipeLayer
14: QwenBlockPipeLayer
15: QwenBlockPipeLayer
16: QwenBlockPipeLayer
17: QwenBlockPipeLayer
18: QwenBlockPipeLayer
19: QwenBlockPipeLayer
20: QwenBlockPipeLayer
21: QwenBlockPipeLayer
22: QwenBlockPipeLayer
23: QwenBlockPipeLayer
24: QwenBlockPipeLayer
stage=1 layers=24
25: QwenBlockPipeLayer
26: QwenBlockPipeLayer
27: QwenBlockPipeLayer
28: QwenBlockPipeLayer
29: QwenBlockPipeLayer
30: QwenBlockPipeLayer
31: QwenBlockPipeLayer
32: QwenBlockPipeLayer
33: QwenBlockPipeLayer
34: QwenBlockPipeLayer
35: QwenBlockPipeLayer
36: QwenBlockPipeLayer
37: QwenBlockPipeLayer
38: QwenBlockPipeLayer
39: QwenBlockPipeLayer
40: QwenBlockPipeLayer
41: QwenBlockPipeLayer
42: QwenBlockPipeLayer
43: QwenBlockPipeLayer
44: QwenBlockPipeLayer
45: FLNPipeLayer
46: LMPipeLayer
47: LossPipeLayer
48: IndentityPipeLayerLast
GPU1 Trainable Params: 0
Traceback (most recent call last):
File "train_pipeline.py", line 260, in
I've deleted an occupy
parameters by mistaskes. I feel sorry for that. The new version has been pushed in 4d370b275810e89bdb28d8210e6e173f3d15ec68. Thanks for your helpful issue!
ok,thanks. it works for me.
When i run ddp training code "python -m torch.distributed.run --nproc_per_node=2 --master_port=12233 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2",GPU0 Trainable Params: 3937280, GPU1 Trainable Params: 0. Because one card is enough to measure parameters, this problem seems to occur. Is this what you mean? GPU1 Trainable Params: 0 Traceback (most recent call last): File "train_pipeline.py", line 260, in
main()
File "trainpipeline.py", line 181, in main
engine, optimizer, , _ = deepspeed.initialize(
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/init.py", line 192, in initialize
engine = PipelineEngine(args=args,
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in init
super().init(*super_args, **super_kwargs)
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 307, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 90, in init
super(FusedAdam, self).init(params, defaults)
File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/optim/optimizer.py", line 187, in init
raise ValueError("optimizer got an empty parameter list")
ValueError: optimizer got an empty parameter list
GPU0 Trainable Params: 3937280