Coobiw / MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
349 stars 19 forks source link

deepspeed training, meet the error "ValueError: optimizer got an empty parameter list" #13

Closed sunnnnnnnny closed 5 months ago

sunnnnnnnny commented 5 months ago

When i run ddp training code "python -m torch.distributed.run --nproc_per_node=2 --master_port=12233 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2",GPU0 Trainable Params: 3937280, GPU1 Trainable Params: 0. Because one card is enough to measure parameters, this problem seems to occur. Is this what you mean? GPU1 Trainable Params: 0 Traceback (most recent call last): File "train_pipeline.py", line 260, in main() File "trainpipeline.py", line 181, in main engine, optimizer, , _ = deepspeed.initialize( File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/init.py", line 192, in initialize engine = PipelineEngine(args=args, File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in init super().init(*super_args, **super_kwargs) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 307, in init self._configure_optimizer(optimizer, model_parameters) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer optimizer = FusedAdam( File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 90, in init super(FusedAdam, self).init(params, defaults) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/optim/optimizer.py", line 187, in init raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list GPU0 Trainable Params: 3937280

Coobiw commented 5 months ago

hello,if convenient,I need more of your logs like the layers and their corresponding partition.

---- Replied Message ---- From sunnnnnnnny @.> Date 03/13/2024 15:25 To Coobiw/MiniGPT4Qwen @.> Cc Subscribed @.***> Subject [Coobiw/MiniGPT4Qwen] deepspeed training, meet the error "ValueError: optimizer got an empty parameter list" (Issue #13)

When i run ddp training code "python -m torch.distributed.run --nproc_per_node=2 --master_port=12233 train_pipeline.py --cfg-path lavis/projects/pp_qwen14b/train_pp.yaml --num-stages 2",GPU0 Trainable Params: 3937280, GPU1 Trainable Params: 0. Because one card is enough to measure parameters, this problem seems to occur. Is this what you mean? GPU1 Trainable Params: 0 Traceback (most recent call last): File "train_pipeline.py", line 260, in main() File "trainpipeline.py", line 181, in main engine, optimizer, , _ = deepspeed.initialize( File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/init.py", line 192, in initialize engine = PipelineEngine(args=args, File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in init super().init(super_args, super_kwargs) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 307, in init self._configure_optimizer(optimizer, model_parameters) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer optimizer = FusedAdam( File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 90, in init super(FusedAdam, self).init(params, defaults) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/optim/optimizer.py", line 187, in init raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list GPU0 Trainable Params: 3937280 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

sunnnnnnnny commented 5 months ago

[2024-03-13 08:37:15,006] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method uniform stage=0 layers=25 0: TokenizerPipeLayer 1: IndentityPipeLayer 2: IndentityPipeLayer 3: IndentityPipeLayer 4: IndentityPipeLayer 5: QwenBlockPipeLayer 6: QwenBlockPipeLayer 7: QwenBlockPipeLayer 8: QwenBlockPipeLayer 9: QwenBlockPipeLayer 10: QwenBlockPipeLayer 11: QwenBlockPipeLayer 12: QwenBlockPipeLayer 13: QwenBlockPipeLayer 14: QwenBlockPipeLayer 15: QwenBlockPipeLayer 16: QwenBlockPipeLayer 17: QwenBlockPipeLayer 18: QwenBlockPipeLayer 19: QwenBlockPipeLayer 20: QwenBlockPipeLayer 21: QwenBlockPipeLayer 22: QwenBlockPipeLayer 23: QwenBlockPipeLayer 24: QwenBlockPipeLayer stage=1 layers=24 25: QwenBlockPipeLayer 26: QwenBlockPipeLayer 27: QwenBlockPipeLayer 28: QwenBlockPipeLayer 29: QwenBlockPipeLayer 30: QwenBlockPipeLayer 31: QwenBlockPipeLayer 32: QwenBlockPipeLayer 33: QwenBlockPipeLayer 34: QwenBlockPipeLayer 35: QwenBlockPipeLayer 36: QwenBlockPipeLayer 37: QwenBlockPipeLayer 38: QwenBlockPipeLayer 39: QwenBlockPipeLayer 40: QwenBlockPipeLayer 41: QwenBlockPipeLayer 42: QwenBlockPipeLayer 43: QwenBlockPipeLayer 44: QwenBlockPipeLayer 45: FLNPipeLayer 46: LMPipeLayer 47: LossPipeLayer 48: IndentityPipeLayerLast GPU1 Trainable Params: 0 Traceback (most recent call last): File "train_pipeline.py", line 260, in main() File "trainpipeline.py", line 181, in main engine, optimizer, , _ = deepspeed.initialize( File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/init.py", line 191, in initialize engine = PipelineEngine(args=args, File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in init super().init(*super_args, **super_kwargs) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 307, in init self._configure_optimizer(optimizer, model_parameters) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer optimizer = FusedAdam( File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 90, in init super(FusedAdam, self).init(params, defaults) File "/home/duser/miniconda3/envs/gpt/lib/python3.8/site-packages/torch/optim/optimizer.py", line 187, in init raise ValueError("optimizer got an empty parameter list") ValueError: optimizer got an empty parameter list GPU0 Trainable Params: 3937280

Coobiw commented 5 months ago

I've deleted an occupy parameters by mistaskes. I feel sorry for that. The new version has been pushed in 4d370b275810e89bdb28d8210e6e173f3d15ec68. Thanks for your helpful issue!

sunnnnnnnny commented 5 months ago

ok,thanks. it works for me.