Closed HuaizeLiu closed 2 months ago
Thanks for your attention! First, you may check your resume code and make sure there is no missing keys or unexpected keys of model weights. Then, you can check if the optimizer is also loaded by accelerate. If these are ok, you may upload more logs.
I check full code and .yaml file. it's ok. And it's my checkpoint file below And more logs below
`
[2024-07-10 14:43:15,967] [INFO] [config.py:987:print_user_config] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
},
"stage3_gather_16bit_weights_on_model_save": false
},
"steps_per_print": inf,
"fp16": {
"enabled": false
},
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
INFO:main:save config to ./hdtf_output/stage1
INFO:main: Running training
INFO:main: Num examples = 197
INFO:main: Num Epochs = 308
INFO:main: Instantaneous batch size per device = 8
INFO:main: Total train batch size (w. parallel, distributed & accumulation) = 16
INFO:main: Gradient Accumulation steps = 1
INFO:main: Total optimization steps = 4000
INFO:main:Loading checkpoint from ./hdtf_output/stage1/checkpoints
INFO:accelerate.accelerator:Loading states from ./hdtf_output/stage1/checkpoints/checkpoint-2000
INFO:accelerate.accelerator:Loading DeepSpeed Model and Optimizer
[2024-07-10 14:43:16,065] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/mp_rank_00_model_states.pt...
[2024-07-10 14:43:16,081] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/zero_pp_rank_1_mp_rank_00_optim_states.pt...
[2024-07-10 14:43:17,867] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/mp_rank_00_model_states.pt.
[2024-07-10 14:43:18,056] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/mp_rank_00_model_states.pt...
[2024-07-10 14:43:19,899] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/mp_rank_00_model_states.pt.
[2024-07-10 14:43:21,396] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/zero_pp_rank_1_mp_rank_00_optim_states.pt.
[2024-07-10 14:43:21,396] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 1
[2024-07-10 14:43:22,197] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-07-10 14:43:22,742] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 1
[2024-07-10 14:43:27,488] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-07-10 14:43:27,488] [INFO] [engine.py:3018:_get_all_zero_checkpoint_state_dicts] successfully read 2 ZeRO state_dicts for rank 0
[2024-07-10 14:43:29,386] [INFO] [engine.py:2968:_load_zero_checkpoint] loading 2 zero partition checkpoints for rank 0
INFO:accelerate.accelerator:DeepSpeed Model and Optimizer loaded from input dir ./hdtf_output/stage1/checkpoints/checkpoint-2000/pytorch_model
INFO:accelerate.checkpointing:All model weights loaded successfully
INFO:accelerate.checkpointing:All optimizer states loaded successfully
INFO:accelerate.checkpointing:All scheduler states loaded successfully
INFO:accelerate.checkpointing:All dataloader sampler states loaded successfully
INFO:accelerate.checkpointing:Could not load random states
INFO:accelerate.accelerator:Loading in 0 custom states
Resuming from checkpoint checkpoint-2000
Steps: 0%| | 0/2000 [00:00<?, ?it/s][2024-07-10 14:43:38,946] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
Steps: 0%| | 1/2000 [00:09<5:01:50, 9.06s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:40,762] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384, reducing to 8192
Steps: 0%|▏ | 2/2000 [00:10<2:39:47, 4.80s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:42,572] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192, reducing to 4096
Steps: 0%|▏ | 3/2000 [00:12<1:54:17, 3.43s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:44,386] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4096, reducing to 2048
Steps: 0%|▎ | 4/2000 [00:14<1:32:57, 2.79s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:45,560] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2048, reducing to 1024
Steps: 0%|▎ | 5/2000 [00:15<1:13:30, 2.21s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:47,370] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 512
Steps: 0%|▍ | 6/2000 [00:17<1:08:55, 2.07s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:49,184] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 512, reducing to 256
Steps: 0%|▍ | 7/2000 [00:19<1:06:04, 1.99s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:50,995] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 256, reducing to 128
Steps: 0%|▌ | 8/2000 [00:21<1:04:09, 1.93s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:52,804] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64
Steps: 0%|▌ | 9/2000 [00:22<1:02:51, 1.89s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:54,620] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 64, reducing to 32
Steps: 0%|▋ | 10/2000 [00:24<1:02:00, 1.87s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:56,431] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32, reducing to 16
Steps: 1%|▋ | 11/2000 [00:26<1:01:23, 1.85s/it, lr=0.0001, step_loss=nan][2024-07-10 14:43:58,244] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16, reducing to 8
Steps: 1%|▊ | 12/2000 [00:28<1:00:57, 1.84s/it, lr=0.0001, step_loss=nan][2024-07-10 14:44:00,133] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8, reducing to 4
Steps: 1%|▊ | 13/2000 [00:30<1:01:25, 1.85s/it, lr=0.0001, step_loss=nan][2024-07-10 14:44:02,838] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4, reducing to 2
Steps: 1%|▉ | 14/2000 [00:32<1:09:54, 2.11s/it, lr=0.0001, step_loss=nan][2024-07-10 14:44:04,653] [INFO]
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2, reducing to 1
Steps: 1%|▉ | 15/2000 [00:34<1:06:54, 2.02s/it, lr=0.0001, step_loss=nan]ERROR:root:Failed to execute the
training process: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
ERROR:root:Failed to execute the training process: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Steps: 1%|▉
`
Do you use same num_processes config as checkpoint?
Do you use same num_processes config as checkpoint?
I use same num_processes. And i have solved this problem. The pretrained_models file are broken. I download them again.
thank you.
train model without checkpoint is ok,but Read the checkpoint of the last training and continue training. The value predicted by the model is nan。