mmiumiu commented 3 months ago

您好，在微调52B时出现了如何报错，具体是在保存模型中转化 Lora层参数时。代码卡在 TeleChat-52B/deepspeed-finetune/utils/module/lora.py -> convert_lora_to_linear_layer -> with deepspeed.zero.GatheredParameters()，使用 zero3+Lora，报错信息如下： epoch:1, global_step:4, step:16, cur_batch_loss: 7.40625 saving step 4 model ... [E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6787 [0] NCCL INFO [Service thread] Connection closed by localRank 0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6775 [0] NCCL INFO comm 0x91234c0 rank 0 nranks 4 cudaDev 0 busId 3b000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. [2024-08-01 21:43:26,475] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6566 [2024-08-01 21:43:26,475] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6567 [2024-08-01 21:43:27,011] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6568 [2024-08-01 21:43:27,585] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6569 [2024-08-01 21:43:28,161] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=3', '--data_path', 'datas/data_files', '--model_name_or_path', '/xxx/TeleChat-52B/models', '--per_device_train_batch_size', '1', '--max_seq_len', '1024', '--with_loss_mask', '--learning_rate', '3e-5', '--weight_decay', '0.0001', '--num_train_epochs', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--warmup_proportion', '0.1', '--gradient_checkpointing', '--seed', '42', '--zero_stage', '3', '--offload', '--lora_dim', '2', '--mark_only_lora_as_trainable', '--lora_module_name', 'attn.c_attn', '--save_steps', '4', '--deepspeed', '--output_dir', 'test'] exits with return code = -6

能否指点一下故障原因？能否给出该代码应该运行的环境（例如torch版本、Deepspeed版本之类），使用开源镜像无法直接微调 52B 模型，且发现该镜像和12B开源链接中的一样？另外：我这里 12B 模型的训推是没有问题的。

mmiumiu commented 3 months ago

【TeleChat-52B 训练日志：】 [2024-08-01 21:04:37,901] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-01 21:04:39,118] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3 [2024-08-01 21:04:39,118] [INFO] [runner.py:570:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --data_path datas/data_files --model_name_or_path /mnt/home/Fuzt/TeleChat-52B/models --per_device_train_batch_size 1 --max_seq_len 1024 --with_loss_mask --learning_rate 3e-5 --weight_decay 0.0001 --num_train_epochs 1 --gradient_accumulation_steps 4 --lr_scheduler_type cosine --warmup_proportion 0.1 --gradient_checkpointing --seed 42 --zero_stage 3 --offload --lora_dim 2 --mark_only_lora_as_trainable --lora_module_name attn.c_attn --save_steps 4 --deepspeed --output_dir test [2024-08-01 21:04:41,497] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-01 21:04:43,139] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2024-08-01 21:04:43,139] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eth0 [2024-08-01 21:04:43,139] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1 [2024-08-01 21:04:43,139] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2024-08-01 21:04:43,139] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2024-08-01 21:04:43,139] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2024-08-01 21:04:43,139] [INFO] [launch.py:163:main] dist_world_size=4 [2024-08-01 21:04:43,139] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2024-08-01 21:04:47,086] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-01 21:04:47,087] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-01 21:04:47,090] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-01 21:04:47,090] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-08-01 21:04:48,392] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-01 21:04:49,722] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-01 21:04:49,735] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-01 21:04:49,742] [INFO] [comm.py:637:init_distributed] cdb=None ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6566 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6566 [0] NCCL INFO Bootstrap : Using eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6566 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6566 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6566 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.6+cuda11.8 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6567 [1] NCCL INFO cudaDriverVersion 12020 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6569 [3] NCCL INFO cudaDriverVersion 12020 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6567 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6567 [1] NCCL INFO Bootstrap : Using eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6568 [2] NCCL INFO cudaDriverVersion 12020 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6567 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6567 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6568 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6568 [2] NCCL INFO Bootstrap : Using eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6568 [2] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6568 [2] NCCL INFO NET/Plugin : No plugin found, using internal implementation ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6569 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6569 [3] NCCL INFO Bootstrap : Using eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6569 [3] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6569 [3] NCCL INFO NET/Plugin : No plugin found, using internal implementation ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO NET/Socket : Using [0]eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO NET/Socket : Using [0]eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO NET/Socket : Using [0]eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1. ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO NET/Socket : Using [0]eth0:192.169.64.15<0> ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO comm 0x91234c0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 3b000 commId 0xba84bfe8b1f80164 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO comm 0x910d890 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId bc000 commId 0xba84bfe8b1f80164 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO comm 0x8cc3a10 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId bb000 commId 0xba84bfe8b1f80164 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO comm 0x8c3b640 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId bd000 commId 0xba84bfe8b1f80164 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Setting affinity for GPU 0 to fe000000,00000000,fe000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Setting affinity for GPU 1 to 07fc,00000000,000007f9,00000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Setting affinity for GPU 3 to 07fc,00000000,000007f9,00000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Setting affinity for GPU 2 to 07fc,00000000,000007f9,00000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Channel 00/02 : 0 1 2 3 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Channel 01/02 : 0 1 2 3 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6780 [0] NCCL INFO comm 0x91234c0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 3b000 commId 0xba84bfe8b1f80164 - Init COMPLETE ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:6781 [1] NCCL INFO comm 0x8cc3a10 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId bb000 commId 0xba84bfe8b1f80164 - Init COMPLETE ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:6783 [3] NCCL INFO comm 0x8c3b640 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId bd000 commId 0xba84bfe8b1f80164 - Init COMPLETE ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:6782 [2] NCCL INFO comm 0x910d890 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId bc000 commId 0xba84bfe8b1f80164 - Init COMPLETE

FLASH ATTENTION 2 DETECTED

TELECHAT flash attention enabled TELECHAT flash attention enabled TELECHAT flash attention enabled [2024-08-01 21:05:07,065] [INFO] [partition_parameters.py:347:exit] finished initializing model - num_params = 387, num_elems = 52.83B TELECHAT flash attention enabled

Loading checkpoint shards: 0%| | 0/11 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/11 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/11 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/11 [00:00<?, ?it/s] Loading checkpoint shards: 9%|▉ | 1/11 [00:15<02:31, 15.15s/it] Loading checkpoint shards: 9%|▉ | 1/11 [00:15<02:31, 15.16s/it] Loading checkpoint shards: 9%|▉ | 1/11 [00:15<02:31, 15.18s/it] Loading checkpoint shards: 9%|▉ | 1/11 [00:15<02:34, 15.48s/it] Loading checkpoint shards: 18%|█▊ | 2/11 [00:30<02:15, 15.06s/it] Loading checkpoint shards: 18%|█▊ | 2/11 [00:30<02:15, 15.06s/it] Loading checkpoint shards: 18%|█▊ | 2/11 [00:30<02:15, 15.08s/it] Loading checkpoint shards: 18%|█▊ | 2/11 [00:30<02:17, 15.23s/it] Loading checkpoint shards: 27%|██▋ | 3/11 [00:45<02:00, 15.02s/it] Loading checkpoint shards: 27%|██▋ | 3/11 [00:45<02:00, 15.03s/it] Loading checkpoint shards: 27%|██▋ | 3/11 [00:45<02:00, 15.02s/it] Loading checkpoint shards: 27%|██▋ | 3/11 [00:45<02:01, 15.13s/it] Loading checkpoint shards: 36%|███▋ | 4/11 [01:00<01:44, 15.00s/it] Loading checkpoint shards: 36%|███▋ | 4/11 [01:00<01:45, 15.00s/it] Loading checkpoint shards: 36%|███▋ | 4/11 [01:00<01:45, 15.02s/it] Loading checkpoint shards: 36%|███▋ | 4/11 [01:00<01:45, 15.05s/it] Loading checkpoint shards: 45%|████▌ | 5/11 [01:15<01:30, 15.03s/it] Loading checkpoint shards: 45%|████▌ | 5/11 [01:15<01:30, 15.03s/it] Loading checkpoint shards: 45%|████▌ | 5/11 [01:15<01:30, 15.04s/it] Loading checkpoint shards: 45%|████▌ | 5/11 [01:15<01:30, 15.06s/it] Loading checkpoint shards: 55%|█████▍ | 6/11 [01:29<01:14, 14.91s/it] Loading checkpoint shards: 55%|█████▍ | 6/11 [01:29<01:14, 14.90s/it] Loading checkpoint shards: 55%|█████▍ | 6/11 [01:29<01:14, 14.91s/it] Loading checkpoint shards: 55%|█████▍ | 6/11 [01:30<01:14, 14.89s/it] Loading checkpoint shards: 64%|██████▎ | 7/11 [01:44<00:59, 14.96s/it] Loading checkpoint shards: 64%|██████▎ | 7/11 [01:44<00:59, 14.96s/it] Loading checkpoint shards: 64%|██████▎ | 7/11 [01:44<00:59, 14.96s/it] Loading checkpoint shards: 64%|██████▎ | 7/11 [01:45<01:00, 15.01s/it] Loading checkpoint shards: 73%|███████▎ | 8/11 [01:59<00:44, 14.96s/it] Loading checkpoint shards: 73%|███████▎ | 8/11 [01:59<00:44, 14.97s/it] Loading checkpoint shards: 73%|███████▎ | 8/11 [01:59<00:44, 14.97s/it] Loading checkpoint shards: 73%|███████▎ | 8/11 [02:00<00:44, 14.99s/it] Loading checkpoint shards: 82%|████████▏ | 9/11 [02:14<00:30, 15.00s/it] Loading checkpoint shards: 82%|████████▏ | 9/11 [02:14<00:30, 15.01s/it] Loading checkpoint shards: 82%|████████▏ | 9/11 [02:14<00:30, 15.00s/it] Loading checkpoint shards: 82%|████████▏ | 9/11 [02:15<00:30, 15.01s/it] Loading checkpoint shards: 91%|█████████ | 10/11 [02:29<00:15, 15.00s/it] Loading checkpoint shards: 91%|█████████ | 10/11 [02:29<00:15, 15.01s/it] Loading checkpoint shards: 91%|█████████ | 10/11 [02:29<00:15, 15.00s/it] Loading checkpoint shards: 91%|█████████ | 10/11 [02:30<00:15, 15.01s/it] Loading checkpoint shards: 100%|██████████| 11/11 [02:42<00:00, 14.32s/it] Loading checkpoint shards: 100%|██████████| 11/11 [02:42<00:00, 14.79s/it] Some weights of TELECHAT were not initialized from the model checkpoint at /mnt/home/Fuzt/TeleChat-52B/models and are newly initialized: ['transformer.wpe.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 11/11 [02:42<00:00, 14.33s/it] Loading checkpoint shards: 100%|██████████| 11/11 [02:42<00:00, 14.80s/it] Some weights of TELECHAT were not initialized from the model checkpoint at /mnt/home/Fuzt/TeleChat-52B/models and are newly initialized: ['transformer.wpe.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Loading checkpoint shards: 100%|██████████| 11/11 [02:42<00:00, 14.33s/it] Loading checkpoint shards: 100%|██████████| 11/11 [02:42<00:00, 14.80s/it] Some weights of TELECHAT were not initialized from the model checkpoint at /mnt/home/Fuzt/TeleChat-52B/models and are newly initialized: ['transformer.wpe.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. train_fname:datas/data_files train_fname:datas/data_files train_fname:datas/data_files

Loading checkpoint shards: 100%|██████████| 11/11 [02:43<00:00, 14.30s/it] Loading checkpoint shards: 100%|██████████| 11/11 [02:43<00:00, 14.82s/it] Some weights of TELECHAT were not initialized from the model checkpoint at /mnt/home/Fuzt/TeleChat-52B/models and are newly initialized: ['transformer.wpe.inv_freq'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. train_fname:datas/data_files Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.396916151046753 seconds Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.4172143936157227 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.4470088481903076 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.5383830070495605 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000030, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000030, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [2024-08-01 21:07:56,473] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.11.0, git-hash=unknown, git-branch=unknown [2024-08-01 21:07:56,473] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000030, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [2024-08-01 21:07:56,491] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-08-01 21:07:56,495] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-08-01 21:07:56,495] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-08-01 21:07:56,503] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2024-08-01 21:07:56,503] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2024-08-01 21:07:56,503] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2024-08-01 21:07:56,503] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000030, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 [2024-08-01 21:07:56,652] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning [2024-08-01 21:07:56,652] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 3.47 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:56,653] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.04 GB, percent = 13.9% [2024-08-01 21:07:56,655] [INFO] [stage3.py:126:init] Reduce bucket size 500,000,000 [2024-08-01 21:07:56,655] [INFO] [stage3.py:127:init] Prefetch bucket size 30000000 [2024-08-01 21:07:56,777] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2024-08-01 21:07:56,777] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:56,778] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.04 GB, percent = 13.9% Parameter Offload: Total persistent parameters: 1581056 in 193 params [2024-08-01 21:07:56,973] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2024-08-01 21:07:56,974] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:56,974] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.05 GB, percent = 13.9% [2024-08-01 21:07:57,106] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions [2024-08-01 21:07:57,107] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:57,107] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.05 GB, percent = 13.9% [2024-08-01 21:07:57,331] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 1 [2024-08-01 21:07:57,331] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:57,332] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.08 GB, percent = 13.9% [2024-08-01 21:07:57,461] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions [2024-08-01 21:07:57,462] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:57,462] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.08 GB, percent = 13.9% [2024-08-01 21:07:57,613] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions [2024-08-01 21:07:57,614] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:57,614] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.08 GB, percent = 13.9% [2024-08-01 21:07:57,827] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states [2024-08-01 21:07:57,828] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:57,828] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.11 GB, percent = 13.9% [2024-08-01 21:07:58,021] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states [2024-08-01 21:07:58,021] [INFO] [utils.py:803:see_memory_usage] MA 1.0 GB Max_MA 1.0 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:58,021] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.12 GB, percent = 13.9% [2024-08-01 21:07:58,022] [INFO] [stage3.py:459:_setup_for_real_optimizer] optimizer state initialized You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. [2024-08-01 21:07:58,303] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer [2024-08-01 21:07:58,304] [INFO] [utils.py:803:see_memory_usage] MA 1.93 GB Max_MA 1.93 GB CA 5.61 GB Max_CA 6 GB [2024-08-01 21:07:58,304] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 140.29 GB, percent = 13.9% [2024-08-01 21:07:58,304] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam [2024-08-01 21:07:58,305] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-08-01 21:07:58,305] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f3834c03a30> [2024-08-01 21:07:58,305] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.95)] [2024-08-01 21:07:58,307] [INFO] [config.py:968:print] DeepSpeedEngine configuration: [2024-08-01 21:07:58,307] [INFO] [config.py:972:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-08-01 21:07:58,307] [INFO] [config.py:972:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-08-01 21:07:58,307] [INFO] [config.py:972:print] amp_enabled .................. False [2024-08-01 21:07:58,307] [INFO] [config.py:972:print] amp_params ................... False [2024-08-01 21:07:58,307] [INFO] [config.py:972:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-08-01 21:07:58,307] [INFO] [config.py:972:print] bfloat16_enabled ............. True [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] checkpoint_parallel_write_pipeline False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] checkpoint_tag_validation_enabled True [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] checkpoint_tag_validation_fail False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f383547b0a0> [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] communication_data_type ...... None [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] curriculum_enabled_legacy .... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] curriculum_params_legacy ..... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] data_efficiency_enabled ...... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] dataloader_drop_last ......... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] disable_allgather ............ False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] dump_state ................... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] dynamic_loss_scale_args ...... None [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_enabled ........... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_gas_boundary_resolution 1 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_layer_num ......... 0 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_max_iter .......... 100 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_stability ......... 1e-06 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_tol ............... 0.01 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] eigenvalue_verbose ........... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] elasticity_enabled ........... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] fp16_auto_cast ............... None [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] fp16_enabled ................. False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] fp16_master_weights_and_gradients False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] global_rank .................. 0 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] grad_accum_dtype ............. None [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] gradient_accumulation_steps .. 4 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] gradient_clipping ............ 1.0 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] gradient_predivide_factor .... 1.0 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] initial_dynamic_scale ........ 1 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] load_universal_checkpoint .... False [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] loss_scale ................... 1.0 [2024-08-01 21:07:58,308] [INFO] [config.py:972:print] memory_breakdown ............. False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] mics_hierarchial_params_gather False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] mics_shard_size .............. -1 [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] optimizer_legacy_fusion ...... False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] optimizer_name ............... None [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] optimizer_params ............. None [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] pld_enabled .................. False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] pld_params ................... False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] prescale_gradients ........... False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] scheduler_name ............... None [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] scheduler_params ............. None [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] sparse_attention ............. None [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] sparse_gradients_enabled ..... False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] steps_per_print .............. 1 [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] train_batch_size ............. 16 [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] train_micro_batch_size_per_gpu 1 [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] use_node_local_storage ....... True [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] wall_clock_breakdown ......... False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] weight_quantization_config ... None [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] world_size ................... 4 [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] zero_allow_untested_optimizer False [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='cpu', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=30000000 param_persistence_threshold=10000 model_persistence_threshold=sys.maxsize max_live_parameters=30000000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] zero_enabled ................. True [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] zero_force_ds_cpu_optimizer .. True [2024-08-01 21:07:58,309] [INFO] [config.py:972:print] zero_optimization_stage ...... 3 [2024-08-01 21:07:58,309] [INFO] [config.py:958:print_user_config] json = { "bf16": { "enabled": true }, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 1, "zero_optimization": { "stage": 3, "offload_param": { "device": "cpu" }, "offload_optimizer": { "device": "cpu" }, "stage3_param_persistence_threshold": 1.000000e+04, "stage3_max_live_parameters": 3.000000e+07, "stage3_prefetch_bucket_size": 3.000000e+07 }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "checkpoint": { "use_node_local_storage": true } } You are using an old version of the checkpointing format that is deprecated (We will also silently ignore gradient_checkpointing_kwargs in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method _set_gradient_checkpointing in your model. Running training Beginning of Epoch 1/1, Total Micro Batches 2500 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Using network Socket ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO comm 0x2a793100 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x2c56ba17ca88e7e5 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO comm 0x2a2a2e00 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId bd000 commId 0x2c56ba17ca88e7e5 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO comm 0x2a7741c0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId bc000 commId 0x2c56ba17ca88e7e5 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO comm 0x2a327300 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId bb000 commId 0x2c56ba17ca88e7e5 - Init START ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Setting affinity for GPU 1 to 07fc,00000000,000007f9,00000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Setting affinity for GPU 3 to 07fc,00000000,000007f9,00000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Setting affinity for GPU 2 to 07fc,00000000,000007f9,00000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Setting affinity for GPU 0 to fe000000,00000000,fe000000 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Channel 00/02 : 0 1 2 3 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Channel 01/02 : 0 1 2 3 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO P2P Chunksize set to 131072 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Channel 00 : 3[3] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Channel 01 : 3[3] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Connected all rings ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/IPC ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO Connected all trees ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer ctmt240730075851vo9-6bd4bc4f59-kxc6g:6567:8109 [1] NCCL INFO comm 0x2a327300 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId bb000 commId 0x2c56ba17ca88e7e5 - Init COMPLETE ctmt240730075851vo9-6bd4bc4f59-kxc6g:6569:8110 [3] NCCL INFO comm 0x2a2a2e00 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId bd000 commId 0x2c56ba17ca88e7e5 - Init COMPLETE ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:8108 [0] NCCL INFO comm 0x2a793100 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x2c56ba17ca88e7e5 - Init COMPLETE ctmt240730075851vo9-6bd4bc4f59-kxc6g:6568:8111 [2] NCCL INFO comm 0x2a7741c0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId bc000 commId 0x2c56ba17ca88e7e5 - Init COMPLETE /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py:1285: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py:1285: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py:1285: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py:1285: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) [2024-08-01 21:09:27,839] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[4.838709677419355e-07], mom=[(0.9, 0.95)] epoch:1, global_step:1, step:4, cur_batch_loss: 7.125 [2024-08-01 21:10:46,754] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[9.67741935483871e-07], mom=[(0.9, 0.95)] epoch:1, global_step:2, step:8, cur_batch_loss: 7.375 [2024-08-01 21:12:05,732] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[1.4516129032258064e-06], mom=[(0.9, 0.95)] [2024-08-01 21:12:05,733] [INFO] [timer.py:260:stop] epoch=0/micro_step=12/global_step=3, RunningAvgSamplesPerSec=0.20261741813227066, CurrSamplesPerSec=0.20261741813227066, MemAllocated=2.35GB, MaxMemAllocated=6.14GB epoch:1, global_step:3, step:12, cur_batch_loss: 7.5 [2024-08-01 21:13:24,719] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[1.935483870967742e-06], mom=[(0.9, 0.95)] [2024-08-01 21:13:24,720] [INFO] [timer.py:260:stop] epoch=0/micro_step=16/global_step=4, RunningAvgSamplesPerSec=0.2026074253901565, CurrSamplesPerSec=0.2025974336336434, MemAllocated=2.35GB, MaxMemAllocated=6.14GB epoch:1, global_step:4, step:16, cur_batch_loss: 7.40625 saving step 4 model ... [E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6787 [0] NCCL INFO [Service thread] Connection closed by localRank 0 ctmt240730075851vo9-6bd4bc4f59-kxc6g:6566:6775 [0] NCCL INFO comm 0x91234c0 rank 0 nranks 4 cudaDev 0 busId 3b000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. [2024-08-01 21:43:26,475] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6566 [2024-08-01 21:43:26,475] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6567 [2024-08-01 21:43:27,011] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6568 [2024-08-01 21:43:27,585] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 6569 [2024-08-01 21:43:28,161] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=3', '--data_path', 'datas/data_files', '--model_name_or_path', '/mnt/home/Fuzt/TeleChat-52B/models', '--per_device_train_batch_size', '1', '--max_seq_len', '1024', '--with_loss_mask', '--learning_rate', '3e-5', '--weight_decay', '0.0001', '--num_train_epochs', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--warmup_proportion', '0.1', '--gradient_checkpointing', '--seed', '42', '--zero_stage', '3', '--offload', '--lora_dim', '2', '--mark_only_lora_as_trainable', '--lora_module_name', 'attn.c_attn', '--save_steps', '4', '--deepspeed', '--output_dir', 'test'] exits with return code = -6

【pip_list】 Package Version

absl-py 2.0.0 accelerate 0.26.1 aiohttp 3.9.1 aiosignal 1.3.1 altair 5.3.0 anyio 4.4.0 apex 0.1 async-timeout 4.0.3 attrs 23.2.0 bcrypt 4.1.2 bfloat16 1.1 blinker 1.7.0 cachetools 5.3.2 certifi 2023.11.17 cffi 1.16.0 cfgv 3.4.0 charset-normalizer 2.1.1 click 8.1.7 colossalai 0.3.3 contexttimer 0.3.3 cryptography 41.0.7 datasets 2.14.6 decorator 5.1.1 deepspeed 0.11.0 Deprecated 1.2.14 dill 0.3.7 distlib 0.3.8 dnspython 2.6.1 einops 0.7.0 email_validator 2.2.0 exceptiongroup 1.2.2 fabric 3.2.2 fastapi 0.111.1 fastapi-cli 0.0.4 filelock 3.13.1 flash-attn 2.5.0 Flask 3.0.0 frozenlist 1.4.1 fsspec 2023.10.0 gitdb 4.0.11 GitPython 3.1.43 google-auth 2.26.2 google-auth-oauthlib 1.2.0 grpcio 1.60.0 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 identify 2.5.33 idna 3.4 invoke 2.2.0 itsdangerous 2.1.2 Jinja2 3.1.2 joblib 1.3.2 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 loralib 0.1.2 Markdown 3.5.2 markdown-it-py 3.0.0 MarkupSafe 2.1.3 mdurl 0.1.2 mpi4py 3.1.6 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.15 networkx 3.0 ninja 1.11.1.1 nltk 3.7 nodeenv 1.8.0 numpy 1.24.4 oauthlib 3.2.2 packaging 23.2 pandas 2.1.4 paramiko 3.4.0 peft 0.5.0 Pillow 10.1.0 pip 23.3.1 platformdirs 4.1.0 pre-commit 3.6.0 protobuf 3.20.3 psutil 5.9.7 py-cpuinfo 9.0.0 pyarrow 14.0.2 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.11.1 pycparser 2.21 pydantic 1.10.13 pydeck 0.9.1 Pygments 2.17.2 PyNaCl 1.5.0 python-dateutil 2.8.2 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2023.3.post1 PyYAML 6.0.1 referencing 0.35.1 regex 2023.12.25 requests 2.28.1 requests-oauthlib 1.3.1 rich 13.7.0 rouge 1.0.1 rpds-py 0.19.1 rsa 4.9 safetensors 0.4.1 sentencepiece 0.1.99 setuptools 69.0.2 shellingham 1.5.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 starlette 0.37.2 streamlit 1.37.0 sympy 1.12 tenacity 8.5.0 tensorboard 2.15.0 tensorboard-data-server 0.7.2 tiktoken 0.5.2 tokenizers 0.19.1 toml 0.10.2 toolz 0.12.1 torch 2.1.1+cu118 torchaudio 2.1.1+cu118 torchvision 0.16.1+cu118 tornado 6.4.1 tqdm 4.66.1 transformers 4.40.2 transformers-stream-generator 0.0.4 triton 2.1.0 typer 0.12.3 typing_extensions 4.8.0 tzdata 2023.4 urllib3 1.26.13 uvicorn 0.30.3 uvloop 0.19.0 virtualenv 20.25.0 watchdog 4.0.1 watchfiles 0.22.0 websockets 12.0 Werkzeug 3.0.1 wheel 0.42.0 wrapt 1.16.0 xformers 0.0.23+cu118 xxhash 3.4.1 yarl 1.9.4

【nvidia_smi】

保存模型等待时

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+

LSX-Sneakerprogrammer commented 2 months ago

已修复

Tele-AI / TeleChat-52B

Lora 权重转换失败：NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12046, OpType=_ALLGATHER_BASE, NumelIn=6144, NumelOut=24576, Timeout(ms)=1800000) ran for 1800029 milliseconds before timing out. #8

FLASH ATTENTION 2 DETECTED

FLASH ATTENTION 2 DETECTED

FLASH ATTENTION 2 DETECTED

FLASH ATTENTION 2 DETECTED

保存模型等待时