Open bltcn opened 1 year ago
环境:nvidia a10 24g显存,docker:nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04,cpu:Intel® Xeon® Silver 4314×2,mem:256G 日志如下: NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset-verify.json --valid-data ./fewshot-data/dataset-verify.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 4 --skip-init --fp16 --use_lora [2023-09-11 14:56:20,676] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-11 14:56:21,855] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-09-11 14:56:24,579] [INFO] [runner.py:567:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset-verify.json --valid-data ./fewshot-data/dataset-verify.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 4 --skip-init --fp16 --use_lora [2023-09-11 14:56:26,254] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.16.2-1 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1 [2023-09-11 14:56:27,399] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-09-11 14:56:27,399] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-09-11 14:56:27,399] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-09-11 14:56:27,399] [INFO] [launch.py:163:main] dist_world_size=1 [2023-09-11 14:56:27,399] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-09-11 14:56:29,112] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-11 14:56:31,540] [INFO] using world size: 1 and model-parallel size: 1 [2023-09-11 14:56:31,540] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2023-09-11 14:56:31,541] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-11 14:56:31,542] [INFO] [comm.py:631:init_distributed] cdb=None [2023-09-11 14:56:31,543] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-11 14:56:31,543] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-09-11 14:56:31,543] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-09-11 14:56:31,544] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /usr/local/lib/python3.8/dist-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2023-09-11 14:56:41,514] [INFO] [RANK 0] replacing layer 0 attention with lora [2023-09-11 14:56:41,931] [INFO] [RANK 0] replacing layer 14 attention with lora [2023-09-11 14:56:42,349] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2023-09-11 14:56:42,798] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-09-11 14:56:51,081] [INFO] [RANK 0] Will continue but found unexpected_keys! Check whether you are loading correct checkpoints: ['transformer.position_embeddings.weight']. [2023-09-11 14:56:51,081] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-09-11 14:56:54,881] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2023-09-11 14:57:35,115] [INFO] [RANK 0] > Set tokenizer as a THUDM/chatglm-6b tokenizer! Now you can get_tokenizer() everywhere. 71f453ed31a9:6101:6101 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0> 71f453ed31a9:6101:6101 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 71f453ed31a9:6101:6101 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.14.3+cuda11.7 71f453ed31a9:6101:6320 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. 71f453ed31a9:6101:6320 [0] NCCL INFO Failed to open libibverbs.so[.1] 71f453ed31a9:6101:6320 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0> 71f453ed31a9:6101:6320 [0] NCCL INFO Using network Socket 71f453ed31a9:6101:6320 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 00/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 01/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 02/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 03/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 04/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 05/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 06/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 07/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 08/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 09/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 10/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 11/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 12/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 13/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 14/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 15/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 16/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 17/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 18/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 19/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 20/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 21/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 22/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 23/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 24/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 25/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 26/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 27/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 28/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 29/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 30/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 31/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 71f453ed31a9:6101:6320 [0] NCCL INFO Connected all rings 71f453ed31a9:6101:6320 [0] NCCL INFO Connected all trees 71f453ed31a9:6101:6320 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 71f453ed31a9:6101:6320 [0] NCCL INFO comm 0x137af9cf0 rank 0 nranks 1 cudaDev 0 busId 31000 - Init COMPLETE transformer.layers.0.attention.query_key_value.matrix_A.0 transformer.layers.0.attention.query_key_value.matrix_A.1 transformer.layers.0.attention.query_key_value.matrix_A.2 transformer.layers.0.attention.query_key_value.matrix_B.0 transformer.layers.0.attention.query_key_value.matrix_B.1 transformer.layers.0.attention.query_key_value.matrix_B.2 transformer.layers.0.attention.dense.matrix_A.0 transformer.layers.0.attention.dense.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_A.0 transformer.layers.14.attention.query_key_value.matrix_A.1 transformer.layers.14.attention.query_key_value.matrix_A.2 transformer.layers.14.attention.query_key_value.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_B.1 transformer.layers.14.attention.query_key_value.matrix_B.2 transformer.layers.14.attention.dense.matrix_A.0 transformer.layers.14.attention.dense.matrix_B.0 [2023-09-11 15:08:38,379] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1, git-hash=unknown, git-branch=unknown [2023-09-11 15:08:38,379] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead 71f453ed31a9:6101:6324 [0] NCCL INFO Using network Socket 71f453ed31a9:6101:6324 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 00/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 01/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 02/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 03/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 04/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 05/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 06/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 07/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 08/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 09/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 10/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 11/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 12/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 13/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 14/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 15/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 16/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 17/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 18/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 19/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 20/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 21/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 22/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 23/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 24/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 25/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 26/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 27/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 28/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 29/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 30/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 31/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 71f453ed31a9:6101:6324 [0] NCCL INFO Connected all rings 71f453ed31a9:6101:6324 [0] NCCL INFO Connected all trees 71f453ed31a9:6101:6324 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 71f453ed31a9:6101:6324 [0] NCCL INFO comm 0xacd2fd20 rank 0 nranks 1 cudaDev 0 busId 31000 - Init COMPLETE [2023-09-11 15:08:38,459] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.5830075740814209 seconds [2023-09-11 15:08:39,731] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer [2023-09-11 15:08:39,736] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-09-11 15:08:39,737] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> [2023-09-11 15:08:39,737] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 1 optimizer [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:146:init] Reduce bucket size 40000000 [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:147:init] Allgather bucket size 100000000 [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:148:init] CPU Offload: False [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:149:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(655360, False)] [2023-09-11 15:08:42,664] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states [2023-09-11 15:08:42,665] [INFO] [utils.py:804:see_memory_usage] MA 14.56 GB Max_MA 14.56 GB CA 14.68 GB Max_CA 15 GB [2023-09-11 15:08:42,665] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 11.71 GB, percent = 4.7% [2023-09-11 15:08:45,126] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states [2023-09-11 15:08:45,127] [INFO] [utils.py:804:see_memory_usage] MA 14.56 GB Max_MA 14.57 GB CA 14.68 GB Max_CA 15 GB [2023-09-11 15:08:45,127] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 11.71 GB, percent = 4.7% [2023-09-11 15:08:45,127] [INFO] [stage_1_and_2.py:520:init] optimizer state initialized [2023-09-11 15:08:47,864] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer [2023-09-11 15:08:47,865] [INFO] [utils.py:804:see_memory_usage] MA 14.56 GB Max_MA 14.56 GB CA 14.68 GB Max_CA 15 GB [2023-09-11 15:08:47,865] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 11.71 GB, percent = 4.7% [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[[0.9, 0.95]] [2023-09-11 15:08:47,868] [INFO] [config.py:960:print] DeepSpeedEngine configuration: [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] amp_enabled .................. False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] amp_params ................... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] bfloat16_enabled ............. False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa5500e2370> [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] communication_data_type ...... None [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] curriculum_params_legacy ..... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] data_efficiency_enabled ...... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] dataloader_drop_last ......... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] disable_allgather ............ False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] dump_state ................... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 400, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_enabled ........... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1 [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0 [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100 [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] eigenvalue_verbose ........... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] elasticity_enabled ........... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] fp16_auto_cast ............... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] fp16_enabled ................. True [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] global_rank .................. 0 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] grad_accum_dtype ............. None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] gradient_clipping ............ 0.1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] initial_dynamic_scale ........ 65536 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] load_universal_checkpoint .... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] loss_scale ................... 0 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] memory_breakdown ............. False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] mics_hierarchial_params_gather False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] mics_shard_size .............. -1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] optimizer_name ............... adam [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] optimizer_params ............. {'lr': 0.0001, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.01} [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] pld_enabled .................. False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] pld_params ................... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] prescale_gradients ........... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] scheduler_name ............... None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] scheduler_params ............. None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] sparse_attention ............. None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] steps_per_print .............. 10 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] train_batch_size ............. 4 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 4 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] use_node_local_storage ....... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] wall_clock_breakdown ......... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] world_size ................... 1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] zero_allow_untested_optimizer True [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] zero_config .................. stage=1 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=40000000 allgather_partitions=True allgather_bucket_size=100000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] zero_enabled ................. True [2023-09-11 15:08:47,871] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True [2023-09-11 15:08:47,871] [INFO] [config.py:964:print] zero_optimization_stage ...... 1 [2023-09-11 15:08:47,871] [INFO] [config.py:950:print_user_config] json = { "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 1, "gradient_clipping": 0.1, "zero_optimization": { "stage": 1, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 4.000000e+07, "allgather_bucket_size": 1.000000e+08, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": false }, "optimizer": { "type": "Adam", "params": { "lr": 0.0001, "betas": [0.9, 0.95], "eps": 1e-08, "weight_decay": 0.01 } }, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2023-09-11 15:08:47,871] [INFO] [RANK 0] learning rate decaying style cosine, ratio 10.0 [2023-09-11 15:08:47,871] [INFO] [RANK 0] Finetuning Model... [2023-09-11 15:08:47,871] [INFO] [RANK 0] arguments: [2023-09-11 15:08:47,871] [INFO] [RANK 0] model_class .................. VisualGLMModel [2023-09-11 15:08:47,871] [INFO] [RANK 0] tokenizer_type ............... THUDM/chatglm-6b [2023-09-11 15:08:47,871] [INFO] [RANK 0] num_layers ................... 28 [2023-09-11 15:08:47,871] [INFO] [RANK 0] hidden_size .................. 4096 [2023-09-11 15:08:47,871] [INFO] [RANK 0] num_attention_heads .......... 32 [2023-09-11 15:08:47,871] [INFO] [RANK 0] vocab_size ................... 130528 [2023-09-11 15:08:47,871] [INFO] [RANK 0] layernorm_order .............. post [2023-09-11 15:08:47,871] [INFO] [RANK 0] model_parallel_size .......... 1 [2023-09-11 15:08:47,871] [INFO] [RANK 0] max_sequence_length .......... 2048 [2023-09-11 15:08:47,871] [INFO] [RANK 0] image_length ................. 32 [2023-09-11 15:08:47,871] [INFO] [RANK 0] eva_args ..................... {'num_layers': 39, 'hidden_size': 1408, 'num_attention_heads': 16, 'vocab_size': 1, 'layernorm_order': 'pre', 'model_parallel_size': 1, 'max_sequence_length': 257, 'inner_hidden_size': 6144, 'use_final_layernorm': False, 'layernorm_epsilon': 1e-06, 'image_size': [224, 224], 'pre_len': 1, 'post_len': 0, 'in_channels': 3, 'num_classes': 0, 'patch_size': 14} [2023-09-11 15:08:47,871] [INFO] [RANK 0] qformer_args ................. {'num_layers': 12, 'hidden_size': 768, 'num_attention_heads': 12, 'vocab_size': 32, 'layernorm_order': 'post', 'model_parallel_size': 1, 'max_sequence_length': 0, 'is_decoder': [True, False, True, False, True, False, True, False, True, False, True, False], 'cross_attn_hidden_size': 1408, 'layernorm_epsilon': 1e-12} [2023-09-11 15:08:47,871] [INFO] [RANK 0] bos_token_id ................. 130004 [2023-09-11 15:08:47,871] [INFO] [RANK 0] mask_token_id ................ 130000 [2023-09-11 15:08:47,871] [INFO] [RANK 0] gmask_token_id ............... 130001 [2023-09-11 15:08:47,871] [INFO] [RANK 0] pad_token_id ................. 3 [2023-09-11 15:08:47,871] [INFO] [RANK 0] image_size ................... [224, 224] [2023-09-11 15:08:47,871] [INFO] [RANK 0] pre_len ...................... 1 [2023-09-11 15:08:47,871] [INFO] [RANK 0] post_len ..................... 0 [2023-09-11 15:08:47,871] [INFO] [RANK 0] in_channels .................. 3 [2023-09-11 15:08:47,871] [INFO] [RANK 0] patch_size ................... 14 [2023-09-11 15:08:47,871] [INFO] [RANK 0] inner_hidden_size ............ None [2023-09-11 15:08:47,871] [INFO] [RANK 0] hidden_size_per_attention_head None [2023-09-11 15:08:47,871] [INFO] [RANK 0] skip_init .................... True [2023-09-11 15:08:47,872] [INFO] [RANK 0] use_gpu_initialization ....... False [2023-09-11 15:08:47,872] [INFO] [RANK 0] num_multi_query_heads ........ 0 [2023-09-11 15:08:47,872] [INFO] [RANK 0] layernorm_epsilon ............ 1e-05 [2023-09-11 15:08:47,872] [INFO] [RANK 0] hidden_dropout ............... 0.1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] attention_dropout ............ 0.1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] make_vocab_size_divisible_by . 128 [2023-09-11 15:08:47,872] [INFO] [RANK 0] experiment_name .............. finetune-visualglm-6b-09-11-14-57 [2023-09-11 15:08:47,872] [INFO] [RANK 0] train_iters .................. 300 [2023-09-11 15:08:47,872] [INFO] [RANK 0] batch_size ................... 4 [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr ........................... 0.0001 [2023-09-11 15:08:47,872] [INFO] [RANK 0] mode ......................... finetune [2023-09-11 15:08:47,872] [INFO] [RANK 0] seed ......................... 1234 [2023-09-11 15:08:47,872] [INFO] [RANK 0] zero_stage ................... 1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] checkpoint_activations ....... True [2023-09-11 15:08:47,872] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] fp16 ......................... True [2023-09-11 15:08:47,872] [INFO] [RANK 0] bf16 ......................... False [2023-09-11 15:08:47,872] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] epochs ....................... None [2023-09-11 15:08:47,872] [INFO] [RANK 0] log_interval ................. 50 [2023-09-11 15:08:47,872] [INFO] [RANK 0] summary_dir .................. [2023-09-11 15:08:47,872] [INFO] [RANK 0] save_args .................... False [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr_decay_iters ............... None [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr_decay_style ............... cosine [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] warmup ....................... 0.02 [2023-09-11 15:08:47,872] [INFO] [RANK 0] weight_decay ................. 0.01 [2023-09-11 15:08:47,872] [INFO] [RANK 0] save ......................... ./checkpoints/finetune-visualglm-6b-09-11-14-57 [2023-09-11 15:08:47,872] [INFO] [RANK 0] load ......................... None [2023-09-11 15:08:47,872] [INFO] [RANK 0] save_interval ................ 300 [2023-09-11 15:08:47,872] [INFO] [RANK 0] no_save_rng .................. False [2023-09-11 15:08:47,872] [INFO] [RANK 0] no_load_rng .................. False [2023-09-11 15:08:47,872] [INFO] [RANK 0] resume_dataloader ............ True [2023-09-11 15:08:47,872] [INFO] [RANK 0] distributed_backend .......... nccl [2023-09-11 15:08:47,872] [INFO] [RANK 0] local_rank ................... 0 [2023-09-11 15:08:47,872] [INFO] [RANK 0] exit_interval ................ None [2023-09-11 15:08:47,872] [INFO] [RANK 0] eval_batch_size .............. 8 [2023-09-11 15:08:47,872] [INFO] [RANK 0] eval_iters ................... 10 [2023-09-11 15:08:47,872] [INFO] [RANK 0] eval_interval ................ 10000 [2023-09-11 15:08:47,872] [INFO] [RANK 0] strict_eval .................. False [2023-09-11 15:08:47,872] [INFO] [RANK 0] train_data ................... ['./fewshot-data/dataset-verify.json'] [2023-09-11 15:08:47,872] [INFO] [RANK 0] train_data_weights ........... None [2023-09-11 15:08:47,873] [INFO] [RANK 0] iterable_dataset ............. False [2023-09-11 15:08:47,873] [INFO] [RANK 0] valid_data ................... ['./fewshot-data/dataset-verify.json'] [2023-09-11 15:08:47,873] [INFO] [RANK 0] test_data .................... None [2023-09-11 15:08:47,873] [INFO] [RANK 0] split ........................ 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] num_workers .................. 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] block_size ................... 10000 [2023-09-11 15:08:47,873] [INFO] [RANK 0] prefetch_factor .............. 4 [2023-09-11 15:08:47,873] [INFO] [RANK 0] temperature .................. 1.0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] top_p ........................ 0.0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] top_k ........................ 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] num_beams .................... 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] length_penalty ............... 0.0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] no_repeat_ngram_size ......... 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] min_tgt_length ............... 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] out_seq_length ............... 256 [2023-09-11 15:08:47,873] [INFO] [RANK 0] input_source ................. interactive [2023-09-11 15:08:47,873] [INFO] [RANK 0] output_path .................. ./samples [2023-09-11 15:08:47,873] [INFO] [RANK 0] with_id ...................... False [2023-09-11 15:08:47,873] [INFO] [RANK 0] max_inference_batch_size ..... 12 [2023-09-11 15:08:47,873] [INFO] [RANK 0] device ....................... cpu [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed .................... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 4, 'gradient_accumulation_steps': 1, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 1, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 40000000.0, 'allgather_bucket_size': 100000000.0, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'fp16': {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': False}, 'optimizer': {'type': 'Adam', 'params': {'lr': 0.0001, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.01}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepscale .................... False [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepscale_config ............. None [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed_mpi ................ False [2023-09-11 15:08:47,873] [INFO] [RANK 0] cuda ......................... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] rank ......................... 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] world_size ................... 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2023-09-11 15:08:47,873] [INFO] [RANK 0] master_ip .................... 127.0.0.1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] master_port .................. 16666 [2023-09-11 15:08:47,873] [INFO] [RANK 0] max_source_length ............ 64 [2023-09-11 15:08:47,873] [INFO] [RANK 0] max_target_length ............ 256 [2023-09-11 15:08:47,873] [INFO] [RANK 0] ignore_pad_token_for_loss .... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] source_prefix ................ [2023-09-11 15:08:47,873] [INFO] [RANK 0] pre_seq_len .................. 4 [2023-09-11 15:08:47,873] [INFO] [RANK 0] lora_rank .................... 10 [2023-09-11 15:08:47,873] [INFO] [RANK 0] use_ptuning .................. False [2023-09-11 15:08:47,873] [INFO] [RANK 0] use_lora ..................... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] use_qlora .................... False [2023-09-11 15:08:47,873] [INFO] [RANK 0] layer_range .................. [0, 14] [2023-09-11 15:08:47,874] [INFO] [RANK 0] do_train ..................... True [2023-09-11 15:08:47,874] [INFO] [RANK 0] val_last_shape ............... [] [2023-09-11 15:08:47,874] [INFO] [RANK 0] val_drop_number .............. 0 [2023-09-11 15:08:47,874] [INFO] [RANK 0] do_valid ..................... True [2023-09-11 15:08:47,874] [INFO] [RANK 0] do_test ...................... False [2023-09-11 15:08:47,874] [INFO] [RANK 0] iteration .................... 0 71f453ed31a9:6101:6475 [0] NCCL INFO Using network Socket 71f453ed31a9:6101:6475 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 00/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 01/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 02/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 03/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 04/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 05/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 06/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 07/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 08/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 09/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 10/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 11/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 12/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 13/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 14/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 15/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 16/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 17/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 18/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 19/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 20/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 21/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 22/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 23/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 24/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 25/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 26/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 27/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 28/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 29/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 30/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 31/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 71f453ed31a9:6101:6475 [0] NCCL INFO Connected all rings 71f453ed31a9:6101:6475 [0] NCCL INFO Connected all trees 71f453ed31a9:6101:6475 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 71f453ed31a9:6101:6475 [0] NCCL INFO comm 0xacd76e00 rank 0 nranks 1 cudaDev 0 busId 31000 - Init COMPLETE [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:529:forward] Activation Checkpointing Information [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:530:forward] ----Partition Activations False, CPU CHECKPOINTING False [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:531:forward] ----contiguous Memory Checkpointing False with 6 total layers [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:533:forward] ----Synchronization False [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:534:forward] ----Profiling time in checkpointing False [2023-09-11 15:08:57,587] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-09-11 15:08:58,641] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-09-11 15:09:07,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:07,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=3.79834182111924, CurrSamplesPerSec=3.7656125466430996, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:17,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:17,759] [INFO] [timer.py:260:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=3.7727069344075694, CurrSamplesPerSec=3.72878645199102, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:28,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:28,565] [INFO] [timer.py:260:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=3.749089152322475, CurrSamplesPerSec=3.6959479256299463, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:39,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:39,429] [INFO] [timer.py:260:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=3.7333655019933047, CurrSamplesPerSec=3.6825283105672297, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:50,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:50,295] [INFO] [timer.py:260:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=3.7241199319362233, CurrSamplesPerSec=3.683957129892456, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:50,295] [INFO] [RANK 0] iteration 50/ 300 | elapsed time per iteration (ms): 1246.1 | learning rate 5.000E-06 | total loss 5.840703E+00 | loss 5.840703E+00 | loss scale 32768.0 |speed 192.61 samples/(minGPU) [2023-09-11 15:09:50,297] [INFO] [RANK 0] after 50 iterations memory (MB) | allocated: 14931.81298828125 | max allocated: 17771.498046875 | cached: 19290.0 | max cached: 19290.0 [2023-09-11 15:09:50,297] [INFO] [RANK 0] time (ms) | forward: 556.19 | backward: 686.45 | allreduce: 0.00 | optimizer: 2.67 | batch generator: 8.26 | data loader: 5.40 [2023-09-11 15:10:01,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:01,139] [INFO] [timer.py:260:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=3.719093807109329, CurrSamplesPerSec=3.693080905980746, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:11,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:11,976] [INFO] [timer.py:260:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=3.715815161951777, CurrSamplesPerSec=3.7077261237401475, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:22,784] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:22,785] [INFO] [timer.py:260:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=3.7146515371242694, CurrSamplesPerSec=3.708240777701597, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:33,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:33,575] [INFO] [timer.py:260:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=3.714437721805471, CurrSamplesPerSec=3.7065211796116344, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:44,341] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:44,342] [INFO] [timer.py:260:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=3.7151055645323754, CurrSamplesPerSec=3.722630024110137, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:44,342] [INFO] [RANK 0] iteration 100/ 300 | elapsed time per iteration (ms): 1080.9 | learning rate 5.000E-06 | total loss 5.623437E+00 | loss 5.623437E+00 | loss scale 32768.0 |speed 222.03 samples/(minGPU) [2023-09-11 15:10:44,343] [INFO] [RANK 0] time (ms) | forward: 407.80 | backward: 669.69 | allreduce: 0.00 | optimizer: 2.69 | batch generator: 1.32 | data loader: 0.12 [2023-09-11 15:10:55,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=2, lr=[7.667891533457719e-05], mom=[[0.9, 0.95]] [2023-09-11 15:10:55,111] [INFO] [timer.py:260:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=3.7156135199372278, CurrSamplesPerSec=3.715706181055347, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:05,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=2, lr=[7.243820139034464e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:05,843] [INFO] [timer.py:260:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=3.717065516728891, CurrSamplesPerSec=3.741957471105421, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:16,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=2, lr=[6.800643086250122e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:16,553] [INFO] [timer.py:260:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=3.7189077364106304, CurrSamplesPerSec=3.7400955732926238, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:27,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=2, lr=[6.343215915635762e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:27,262] [INFO] [timer.py:260:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=3.720500391512484, CurrSamplesPerSec=3.7379098764703196, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:37,972] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=2, lr=[5.876550294995422e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:37,973] [INFO] [timer.py:260:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=3.7218470016654877, CurrSamplesPerSec=3.7406559497001184, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:37,973] [INFO] [RANK 0] iteration 150/ 300 | elapsed time per iteration (ms): 1072.6 | learning rate 5.830E-05 | total loss 5.725078E+00 | loss 5.725078E+00 | loss scale 32768.0 |speed 223.75 samples/(minGPU) [2023-09-11 15:11:37,974] [INFO] [RANK 0] time (ms) | forward: 404.25 | backward: 664.93 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.30 | data loader: 0.12 [2023-09-11 15:11:48,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=2, lr=[5.4057591105248944e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:48,664] [INFO] [timer.py:260:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=3.723471275234616, CurrSamplesPerSec=3.7431104072836328, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:59,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=2, lr=[4.936000448960631e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:59,346] [INFO] [timer.py:260:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=3.7250850527708503, CurrSamplesPerSec=3.748888274646495, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:10,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=2, lr=[4.4724210845020494e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:10,006] [INFO] [timer.py:260:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=3.726933918666199, CurrSamplesPerSec=3.7549411520092604, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:20,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=2, lr=[4.020100089676376e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:20,660] [INFO] [timer.py:260:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=3.728695161247372, CurrSamplesPerSec=3.755563151498546, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=2, lr=[3.583993187957173e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:31,303] [INFO] [timer.py:260:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=3.730479955904745, CurrSamplesPerSec=3.7611173134261437, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:31,303] [INFO] [RANK 0] iteration 200/ 300 | elapsed time per iteration (ms): 1066.6 | learning rate 3.541E-05 | total loss 5.380703E+00 | loss 5.380703E+00 | loss scale 32768.0 |speed 225.02 samples/(minGPU) [2023-09-11 15:12:31,304] [INFO] [RANK 0] time (ms) | forward: 402.37 | backward: 660.79 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.30 | data loader: 0.12 [2023-09-11 15:12:41,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=2, lr=[3.168878457820915e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:41,937] [INFO] [timer.py:260:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=3.73226334126383, CurrSamplesPerSec=3.774635326145037, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:52,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=2, lr=[2.7793039831193133e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:52,533] [INFO] [timer.py:260:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=3.7344633748176617, CurrSamplesPerSec=3.7779715366600612, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:03,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=2, lr=[2.419538023320901e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:03,129] [INFO] [timer.py:260:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=3.7364764349207826, CurrSamplesPerSec=3.7810153868690453, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:13,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=2, lr=[2.093522249567097e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:13,726] [INFO] [timer.py:260:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=3.738326810011324, CurrSamplesPerSec=3.780000793075728, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:24,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=2, lr=[1.804828558898332e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:24,315] [INFO] [timer.py:260:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=3.740120048593816, CurrSamplesPerSec=3.788265402317096, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:24,316] [INFO] [RANK 0] iteration 250/ 300 | elapsed time per iteration (ms): 1060.3 | learning rate 1.778E-05 | total loss 4.975234E+00 | loss 4.975234E+00 | loss scale 32768.0 |speed 226.36 samples/(minGPU) [2023-09-11 15:13:24,317] [INFO] [RANK 0] time (ms) | forward: 396.91 | backward: 659.91 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.30 | data loader: 0.12 [2023-09-11 15:13:34,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=2, lr=[1.556619939802615e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:34,914] [INFO] [timer.py:260:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=3.7416724772945047, CurrSamplesPerSec=3.779847501326996, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:45,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=2, lr=[1.3516158178517482e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:45,509] [INFO] [timer.py:260:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=3.7431273152873965, CurrSamplesPerSec=3.770404068756794, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:56,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=2, lr=[1.1920622611056975e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:56,088] [INFO] [timer.py:260:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=3.744696499549114, CurrSamplesPerSec=3.7911613770662207, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:14:06,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=2, lr=[1.0797073717209012e-05], mom=[[0.9, 0.95]] [2023-09-11 15:14:06,648] [INFO] [timer.py:260:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=3.746389873887378, CurrSamplesPerSec=3.8006454920330643, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:14:17,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=2, lr=[1.0157821333772305e-05], mom=[[0.9, 0.95]] [2023-09-11 15:14:17,218] [INFO] [timer.py:260:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=3.747843231488021, CurrSamplesPerSec=3.7641205427425337, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:14:17,219] [INFO] [RANK 0] iteration 300/ 300 | elapsed time per iteration (ms): 1058.1 | learning rate 1.012E-05 | total loss 4.691602E+00 | loss 4.691602E+00 | loss scale 32768.0 |speed 226.83 samples/(minGPU) [2023-09-11 15:14:17,219] [INFO] [RANK 0] time (ms) | forward: 398.83 | backward: 655.78 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.29 | data loader: 0.12 [2023-09-11 15:14:17,219] [INFO] [RANK 0] Saving Model... /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-09-11 15:14:17,244] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt [2023-09-11 15:14:17,244] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt... [2023-09-11 15:14:31,573] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt. [2023-09-11 15:14:32,121] [INFO] [RANK 0] Saving Model... [2023-09-11 15:14:32,132] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt [2023-09-11 15:14:32,132] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt... [2023-09-11 15:16:30,648] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt. 71f453ed31a9:6101:6321 [0] NCCL INFO [Service thread] Connection closed by localRank 0 71f453ed31a9:6101:6101 [0] NCCL INFO comm 0x137af9cf0 rank 0 nranks 1 cudaDev 0 busId 31000 - Abort COMPLETE 71f453ed31a9:6101:6476 [0] NCCL INFO [Service thread] Connection closed by localRank 0 71f453ed31a9:6101:6101 [0] NCCL INFO comm 0xacd76e00 rank 0 nranks 1 cudaDev 0 busId 31000 - Abort COMPLETE [2023-09-11 15:16:34,616] [INFO] [launch.py:347:main] Process 6101 exits successfully.
你的--train-iters 300,相当于截断了训练数据300次迭代
这个参数难道不是训练多少层的意思吗?应该填多少比较合适?
环境:nvidia a10 24g显存,docker:nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04,cpu:Intel® Xeon® Silver 4314×2,mem:256G 日志如下: NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset-verify.json --valid-data ./fewshot-data/dataset-verify.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 4 --skip-init --fp16 --use_lora [2023-09-11 14:56:20,676] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-11 14:56:21,855] [WARNING] [runner.py:201:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-09-11 14:56:24,579] [INFO] [runner.py:567:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset-verify.json --valid-data ./fewshot-data/dataset-verify.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 4 --skip-init --fp16 --use_lora [2023-09-11 14:56:26,254] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.16.2-1+cuda11.8 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.16.2-1 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.16.2-1 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.16.2-1+cuda11.8 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2 [2023-09-11 14:56:27,399] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.16.2-1 [2023-09-11 14:56:27,399] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-09-11 14:56:27,399] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-09-11 14:56:27,399] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-09-11 14:56:27,399] [INFO] [launch.py:163:main] dist_world_size=1 [2023-09-11 14:56:27,399] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-09-11 14:56:29,112] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-11 14:56:31,540] [INFO] using world size: 1 and model-parallel size: 1 [2023-09-11 14:56:31,540] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2023-09-11 14:56:31,541] [INFO] [RANK 0] > initializing model parallel with size 1 [2023-09-11 14:56:31,542] [INFO] [comm.py:631:init_distributed] cdb=None [2023-09-11 14:56:31,543] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2023-09-11 14:56:31,543] [INFO] [checkpointing.py:764:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2023-09-11 14:56:31,543] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2023-09-11 14:56:31,544] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /usr/local/lib/python3.8/dist-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2023-09-11 14:56:41,514] [INFO] [RANK 0] replacing layer 0 attention with lora [2023-09-11 14:56:41,931] [INFO] [RANK 0] replacing layer 14 attention with lora [2023-09-11 14:56:42,349] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2023-09-11 14:56:42,798] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-09-11 14:56:51,081] [INFO] [RANK 0] Will continue but found unexpected_keys! Check whether you are loading correct checkpoints: ['transformer.position_embeddings.weight']. [2023-09-11 14:56:51,081] [INFO] [RANK 0] > successfully loaded /root/.sat_models/visualglm-6b/1/mp_rank_00_model_states.pt [2023-09-11 14:56:54,881] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2023-09-11 14:57:35,115] [INFO] [RANK 0] > Set tokenizer as a THUDM/chatglm-6b tokenizer! Now you can get_tokenizer() everywhere. 71f453ed31a9:6101:6101 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0> 71f453ed31a9:6101:6101 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation 71f453ed31a9:6101:6101 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.14.3+cuda11.7 71f453ed31a9:6101:6320 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. 71f453ed31a9:6101:6320 [0] NCCL INFO Failed to open libibverbs.so[.1] 71f453ed31a9:6101:6320 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0> 71f453ed31a9:6101:6320 [0] NCCL INFO Using network Socket 71f453ed31a9:6101:6320 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 00/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 01/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 02/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 03/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 04/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 05/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 06/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 07/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 08/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 09/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 10/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 11/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 12/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 13/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 14/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 15/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 16/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 17/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 18/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 19/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 20/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 21/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 22/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 23/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 24/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 25/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 26/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 27/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 28/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 29/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 30/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Channel 31/32 : 0 71f453ed31a9:6101:6320 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 71f453ed31a9:6101:6320 [0] NCCL INFO Connected all rings 71f453ed31a9:6101:6320 [0] NCCL INFO Connected all trees 71f453ed31a9:6101:6320 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 71f453ed31a9:6101:6320 [0] NCCL INFO comm 0x137af9cf0 rank 0 nranks 1 cudaDev 0 busId 31000 - Init COMPLETE transformer.layers.0.attention.query_key_value.matrix_A.0 transformer.layers.0.attention.query_key_value.matrix_A.1 transformer.layers.0.attention.query_key_value.matrix_A.2 transformer.layers.0.attention.query_key_value.matrix_B.0 transformer.layers.0.attention.query_key_value.matrix_B.1 transformer.layers.0.attention.query_key_value.matrix_B.2 transformer.layers.0.attention.dense.matrix_A.0 transformer.layers.0.attention.dense.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_A.0 transformer.layers.14.attention.query_key_value.matrix_A.1 transformer.layers.14.attention.query_key_value.matrix_A.2 transformer.layers.14.attention.query_key_value.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_B.1 transformer.layers.14.attention.query_key_value.matrix_B.2 transformer.layers.14.attention.dense.matrix_A.0 transformer.layers.14.attention.dense.matrix_B.0 [2023-09-11 15:08:38,379] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.10.1, git-hash=unknown, git-branch=unknown [2023-09-11 15:08:38,379] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead 71f453ed31a9:6101:6324 [0] NCCL INFO Using network Socket 71f453ed31a9:6101:6324 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 00/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 01/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 02/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 03/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 04/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 05/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 06/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 07/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 08/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 09/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 10/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 11/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 12/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 13/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 14/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 15/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 16/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 17/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 18/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 19/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 20/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 21/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 22/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 23/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 24/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 25/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 26/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 27/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 28/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 29/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 30/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Channel 31/32 : 0 71f453ed31a9:6101:6324 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 71f453ed31a9:6101:6324 [0] NCCL INFO Connected all rings 71f453ed31a9:6101:6324 [0] NCCL INFO Connected all trees 71f453ed31a9:6101:6324 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 71f453ed31a9:6101:6324 [0] NCCL INFO comm 0xacd2fd20 rank 0 nranks 1 cudaDev 0 busId 31000 - Init COMPLETE [2023-09-11 15:08:38,459] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.5830075740814209 seconds [2023-09-11 15:08:39,731] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer [2023-09-11 15:08:39,736] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-09-11 15:08:39,737] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> [2023-09-11 15:08:39,737] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 1 optimizer [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:146:init] Reduce bucket size 40000000 [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:147:init] Allgather bucket size 100000000 [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:148:init] CPU Offload: False [2023-09-11 15:08:39,737] [INFO] [stage_1_and_2.py:149:init] Round robin gradient partitioning: False Rank: 0 partition count [1] and sizes[(655360, False)] [2023-09-11 15:08:42,664] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states [2023-09-11 15:08:42,665] [INFO] [utils.py:804:see_memory_usage] MA 14.56 GB Max_MA 14.56 GB CA 14.68 GB Max_CA 15 GB [2023-09-11 15:08:42,665] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 11.71 GB, percent = 4.7% [2023-09-11 15:08:45,126] [INFO] [utils.py:803:see_memory_usage] After initializing optimizer states [2023-09-11 15:08:45,127] [INFO] [utils.py:804:see_memory_usage] MA 14.56 GB Max_MA 14.57 GB CA 14.68 GB Max_CA 15 GB [2023-09-11 15:08:45,127] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 11.71 GB, percent = 4.7% [2023-09-11 15:08:45,127] [INFO] [stage_1_and_2.py:520:init] optimizer state initialized [2023-09-11 15:08:47,864] [INFO] [utils.py:803:see_memory_usage] After initializing ZeRO optimizer [2023-09-11 15:08:47,865] [INFO] [utils.py:804:see_memory_usage] MA 14.56 GB Max_MA 14.56 GB CA 14.68 GB Max_CA 15 GB [2023-09-11 15:08:47,865] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 11.71 GB, percent = 4.7% [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2023-09-11 15:08:47,866] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0001], mom=[[0.9, 0.95]] [2023-09-11 15:08:47,868] [INFO] [config.py:960:print] DeepSpeedEngine configuration: [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] amp_enabled .................. False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] amp_params ................... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] bfloat16_enabled ............. False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] checkpoint_parallel_write_pipeline False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] checkpoint_tag_validation_enabled True [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] checkpoint_tag_validation_fail False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa5500e2370> [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] communication_data_type ...... None [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] curriculum_enabled_legacy .... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] curriculum_params_legacy ..... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] data_efficiency_enabled ...... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] dataloader_drop_last ......... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] disable_allgather ............ False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] dump_state ................... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 400, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1} [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_enabled ........... False [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_gas_boundary_resolution 1 [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_layer_num ......... 0 [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_max_iter .......... 100 [2023-09-11 15:08:47,869] [INFO] [config.py:964:print] eigenvalue_stability ......... 1e-06 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] eigenvalue_tol ............... 0.01 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] eigenvalue_verbose ........... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] elasticity_enabled ........... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] fp16_auto_cast ............... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] fp16_enabled ................. True [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] fp16_master_weights_and_gradients False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] global_rank .................. 0 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] grad_accum_dtype ............. None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] gradient_accumulation_steps .. 1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] gradient_clipping ............ 0.1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] gradient_predivide_factor .... 1.0 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] initial_dynamic_scale ........ 65536 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] load_universal_checkpoint .... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] loss_scale ................... 0 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] memory_breakdown ............. False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] mics_hierarchial_params_gather False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] mics_shard_size .............. -1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] optimizer_legacy_fusion ...... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] optimizer_name ............... adam [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] optimizer_params ............. {'lr': 0.0001, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.01} [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] pld_enabled .................. False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] pld_params ................... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] prescale_gradients ........... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] scheduler_name ............... None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] scheduler_params ............. None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] sparse_attention ............. None [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] sparse_gradients_enabled ..... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] steps_per_print .............. 10 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] train_batch_size ............. 4 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] train_micro_batch_size_per_gpu 4 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] use_node_local_storage ....... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] wall_clock_breakdown ......... False [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] world_size ................... 1 [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] zero_allow_untested_optimizer True [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] zero_config .................. stage=1 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=40000000 allgather_partitions=True allgather_bucket_size=100000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-09-11 15:08:47,870] [INFO] [config.py:964:print] zero_enabled ................. True [2023-09-11 15:08:47,871] [INFO] [config.py:964:print] zero_force_ds_cpu_optimizer .. True [2023-09-11 15:08:47,871] [INFO] [config.py:964:print] zero_optimization_stage ...... 1 [2023-09-11 15:08:47,871] [INFO] [config.py:950:print_user_config] json = { "train_micro_batch_size_per_gpu": 4, "gradient_accumulation_steps": 1, "gradient_clipping": 0.1, "zero_optimization": { "stage": 1, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 4.000000e+07, "allgather_bucket_size": 1.000000e+08, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": false }, "optimizer": { "type": "Adam", "params": { "lr": 0.0001, "betas": [0.9, 0.95], "eps": 1e-08, "weight_decay": 0.01 } }, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2023-09-11 15:08:47,871] [INFO] [RANK 0] learning rate decaying style cosine, ratio 10.0 [2023-09-11 15:08:47,871] [INFO] [RANK 0] Finetuning Model... [2023-09-11 15:08:47,871] [INFO] [RANK 0] arguments: [2023-09-11 15:08:47,871] [INFO] [RANK 0] model_class .................. VisualGLMModel [2023-09-11 15:08:47,871] [INFO] [RANK 0] tokenizer_type ............... THUDM/chatglm-6b [2023-09-11 15:08:47,871] [INFO] [RANK 0] num_layers ................... 28 [2023-09-11 15:08:47,871] [INFO] [RANK 0] hidden_size .................. 4096 [2023-09-11 15:08:47,871] [INFO] [RANK 0] num_attention_heads .......... 32 [2023-09-11 15:08:47,871] [INFO] [RANK 0] vocab_size ................... 130528 [2023-09-11 15:08:47,871] [INFO] [RANK 0] layernorm_order .............. post [2023-09-11 15:08:47,871] [INFO] [RANK 0] model_parallel_size .......... 1 [2023-09-11 15:08:47,871] [INFO] [RANK 0] max_sequence_length .......... 2048 [2023-09-11 15:08:47,871] [INFO] [RANK 0] image_length ................. 32 [2023-09-11 15:08:47,871] [INFO] [RANK 0] eva_args ..................... {'num_layers': 39, 'hidden_size': 1408, 'num_attention_heads': 16, 'vocab_size': 1, 'layernorm_order': 'pre', 'model_parallel_size': 1, 'max_sequence_length': 257, 'inner_hidden_size': 6144, 'use_final_layernorm': False, 'layernorm_epsilon': 1e-06, 'image_size': [224, 224], 'pre_len': 1, 'post_len': 0, 'in_channels': 3, 'num_classes': 0, 'patch_size': 14} [2023-09-11 15:08:47,871] [INFO] [RANK 0] qformer_args ................. {'num_layers': 12, 'hidden_size': 768, 'num_attention_heads': 12, 'vocab_size': 32, 'layernorm_order': 'post', 'model_parallel_size': 1, 'max_sequence_length': 0, 'is_decoder': [True, False, True, False, True, False, True, False, True, False, True, False], 'cross_attn_hidden_size': 1408, 'layernorm_epsilon': 1e-12} [2023-09-11 15:08:47,871] [INFO] [RANK 0] bos_token_id ................. 130004 [2023-09-11 15:08:47,871] [INFO] [RANK 0] mask_token_id ................ 130000 [2023-09-11 15:08:47,871] [INFO] [RANK 0] gmask_token_id ............... 130001 [2023-09-11 15:08:47,871] [INFO] [RANK 0] pad_token_id ................. 3 [2023-09-11 15:08:47,871] [INFO] [RANK 0] image_size ................... [224, 224] [2023-09-11 15:08:47,871] [INFO] [RANK 0] pre_len ...................... 1 [2023-09-11 15:08:47,871] [INFO] [RANK 0] post_len ..................... 0 [2023-09-11 15:08:47,871] [INFO] [RANK 0] in_channels .................. 3 [2023-09-11 15:08:47,871] [INFO] [RANK 0] patch_size ................... 14 [2023-09-11 15:08:47,871] [INFO] [RANK 0] inner_hidden_size ............ None [2023-09-11 15:08:47,871] [INFO] [RANK 0] hidden_size_per_attention_head None [2023-09-11 15:08:47,871] [INFO] [RANK 0] skip_init .................... True [2023-09-11 15:08:47,872] [INFO] [RANK 0] use_gpu_initialization ....... False [2023-09-11 15:08:47,872] [INFO] [RANK 0] num_multi_query_heads ........ 0 [2023-09-11 15:08:47,872] [INFO] [RANK 0] layernorm_epsilon ............ 1e-05 [2023-09-11 15:08:47,872] [INFO] [RANK 0] hidden_dropout ............... 0.1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] attention_dropout ............ 0.1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] make_vocab_size_divisible_by . 128 [2023-09-11 15:08:47,872] [INFO] [RANK 0] experiment_name .............. finetune-visualglm-6b-09-11-14-57 [2023-09-11 15:08:47,872] [INFO] [RANK 0] train_iters .................. 300 [2023-09-11 15:08:47,872] [INFO] [RANK 0] batch_size ................... 4 [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr ........................... 0.0001 [2023-09-11 15:08:47,872] [INFO] [RANK 0] mode ......................... finetune [2023-09-11 15:08:47,872] [INFO] [RANK 0] seed ......................... 1234 [2023-09-11 15:08:47,872] [INFO] [RANK 0] zero_stage ................... 1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] checkpoint_activations ....... True [2023-09-11 15:08:47,872] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] fp16 ......................... True [2023-09-11 15:08:47,872] [INFO] [RANK 0] bf16 ......................... False [2023-09-11 15:08:47,872] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] epochs ....................... None [2023-09-11 15:08:47,872] [INFO] [RANK 0] log_interval ................. 50 [2023-09-11 15:08:47,872] [INFO] [RANK 0] summary_dir .................. [2023-09-11 15:08:47,872] [INFO] [RANK 0] save_args .................... False [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr_decay_iters ............... None [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr_decay_style ............... cosine [2023-09-11 15:08:47,872] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2023-09-11 15:08:47,872] [INFO] [RANK 0] warmup ....................... 0.02 [2023-09-11 15:08:47,872] [INFO] [RANK 0] weight_decay ................. 0.01 [2023-09-11 15:08:47,872] [INFO] [RANK 0] save ......................... ./checkpoints/finetune-visualglm-6b-09-11-14-57 [2023-09-11 15:08:47,872] [INFO] [RANK 0] load ......................... None [2023-09-11 15:08:47,872] [INFO] [RANK 0] save_interval ................ 300 [2023-09-11 15:08:47,872] [INFO] [RANK 0] no_save_rng .................. False [2023-09-11 15:08:47,872] [INFO] [RANK 0] no_load_rng .................. False [2023-09-11 15:08:47,872] [INFO] [RANK 0] resume_dataloader ............ True [2023-09-11 15:08:47,872] [INFO] [RANK 0] distributed_backend .......... nccl [2023-09-11 15:08:47,872] [INFO] [RANK 0] local_rank ................... 0 [2023-09-11 15:08:47,872] [INFO] [RANK 0] exit_interval ................ None [2023-09-11 15:08:47,872] [INFO] [RANK 0] eval_batch_size .............. 8 [2023-09-11 15:08:47,872] [INFO] [RANK 0] eval_iters ................... 10 [2023-09-11 15:08:47,872] [INFO] [RANK 0] eval_interval ................ 10000 [2023-09-11 15:08:47,872] [INFO] [RANK 0] strict_eval .................. False [2023-09-11 15:08:47,872] [INFO] [RANK 0] train_data ................... ['./fewshot-data/dataset-verify.json'] [2023-09-11 15:08:47,872] [INFO] [RANK 0] train_data_weights ........... None [2023-09-11 15:08:47,873] [INFO] [RANK 0] iterable_dataset ............. False [2023-09-11 15:08:47,873] [INFO] [RANK 0] valid_data ................... ['./fewshot-data/dataset-verify.json'] [2023-09-11 15:08:47,873] [INFO] [RANK 0] test_data .................... None [2023-09-11 15:08:47,873] [INFO] [RANK 0] split ........................ 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] num_workers .................. 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] block_size ................... 10000 [2023-09-11 15:08:47,873] [INFO] [RANK 0] prefetch_factor .............. 4 [2023-09-11 15:08:47,873] [INFO] [RANK 0] temperature .................. 1.0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] top_p ........................ 0.0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] top_k ........................ 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] num_beams .................... 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] length_penalty ............... 0.0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] no_repeat_ngram_size ......... 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] min_tgt_length ............... 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] out_seq_length ............... 256 [2023-09-11 15:08:47,873] [INFO] [RANK 0] input_source ................. interactive [2023-09-11 15:08:47,873] [INFO] [RANK 0] output_path .................. ./samples [2023-09-11 15:08:47,873] [INFO] [RANK 0] with_id ...................... False [2023-09-11 15:08:47,873] [INFO] [RANK 0] max_inference_batch_size ..... 12 [2023-09-11 15:08:47,873] [INFO] [RANK 0] device ....................... cpu [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed .................... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 4, 'gradient_accumulation_steps': 1, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 1, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 40000000.0, 'allgather_bucket_size': 100000000.0, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'fp16': {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1}, 'bf16': {'enabled': False}, 'optimizer': {'type': 'Adam', 'params': {'lr': 0.0001, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.01}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepscale .................... False [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepscale_config ............. None [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed_mpi ................ False [2023-09-11 15:08:47,873] [INFO] [RANK 0] cuda ......................... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] rank ......................... 0 [2023-09-11 15:08:47,873] [INFO] [RANK 0] world_size ................... 1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2023-09-11 15:08:47,873] [INFO] [RANK 0] master_ip .................... 127.0.0.1 [2023-09-11 15:08:47,873] [INFO] [RANK 0] master_port .................. 16666 [2023-09-11 15:08:47,873] [INFO] [RANK 0] max_source_length ............ 64 [2023-09-11 15:08:47,873] [INFO] [RANK 0] max_target_length ............ 256 [2023-09-11 15:08:47,873] [INFO] [RANK 0] ignore_pad_token_for_loss .... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] source_prefix ................ [2023-09-11 15:08:47,873] [INFO] [RANK 0] pre_seq_len .................. 4 [2023-09-11 15:08:47,873] [INFO] [RANK 0] lora_rank .................... 10 [2023-09-11 15:08:47,873] [INFO] [RANK 0] use_ptuning .................. False [2023-09-11 15:08:47,873] [INFO] [RANK 0] use_lora ..................... True [2023-09-11 15:08:47,873] [INFO] [RANK 0] use_qlora .................... False [2023-09-11 15:08:47,873] [INFO] [RANK 0] layer_range .................. [0, 14] [2023-09-11 15:08:47,874] [INFO] [RANK 0] do_train ..................... True [2023-09-11 15:08:47,874] [INFO] [RANK 0] val_last_shape ............... [] [2023-09-11 15:08:47,874] [INFO] [RANK 0] val_drop_number .............. 0 [2023-09-11 15:08:47,874] [INFO] [RANK 0] do_valid ..................... True [2023-09-11 15:08:47,874] [INFO] [RANK 0] do_test ...................... False [2023-09-11 15:08:47,874] [INFO] [RANK 0] iteration .................... 0 71f453ed31a9:6101:6475 [0] NCCL INFO Using network Socket 71f453ed31a9:6101:6475 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 00/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 01/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 02/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 03/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 04/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 05/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 06/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 07/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 08/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 09/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 10/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 11/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 12/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 13/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 14/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 15/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 16/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 17/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 18/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 19/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 20/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 21/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 22/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 23/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 24/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 25/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 26/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 27/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 28/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 29/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 30/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Channel 31/32 : 0 71f453ed31a9:6101:6475 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 71f453ed31a9:6101:6475 [0] NCCL INFO Connected all rings 71f453ed31a9:6101:6475 [0] NCCL INFO Connected all trees 71f453ed31a9:6101:6475 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer 71f453ed31a9:6101:6475 [0] NCCL INFO comm 0xacd76e00 rank 0 nranks 1 cudaDev 0 busId 31000 - Init COMPLETE [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:529:forward] Activation Checkpointing Information [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:530:forward] ----Partition Activations False, CPU CHECKPOINTING False [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:531:forward] ----contiguous Memory Checkpointing False with 6 total layers [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:533:forward] ----Synchronization False [2023-09-11 15:08:54,257] [INFO] [checkpointing.py:534:forward] ----Profiling time in checkpointing False [2023-09-11 15:08:57,587] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 [2023-09-11 15:08:58,641] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768 [2023-09-11 15:09:07,080] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:07,081] [INFO] [timer.py:260:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=3.79834182111924, CurrSamplesPerSec=3.7656125466430996, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:17,758] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:17,759] [INFO] [timer.py:260:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=3.7727069344075694, CurrSamplesPerSec=3.72878645199102, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:28,565] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:28,565] [INFO] [timer.py:260:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=3.749089152322475, CurrSamplesPerSec=3.6959479256299463, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:39,429] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:39,429] [INFO] [timer.py:260:stop] epoch=0/micro_step=40/global_step=40, RunningAvgSamplesPerSec=3.7333655019933047, CurrSamplesPerSec=3.6825283105672297, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:50,294] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:09:50,295] [INFO] [timer.py:260:stop] epoch=0/micro_step=50/global_step=50, RunningAvgSamplesPerSec=3.7241199319362233, CurrSamplesPerSec=3.683957129892456, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:09:50,295] [INFO] [RANK 0] iteration 50/ 300 | elapsed time per iteration (ms): 1246.1 | learning rate 5.000E-06 | total loss 5.840703E+00 | loss 5.840703E+00 | loss scale 32768.0 |speed 192.61 samples/(minGPU) [2023-09-11 15:09:50,297] [INFO] [RANK 0] after 50 iterations memory (MB) | allocated: 14931.81298828125 | max allocated: 17771.498046875 | cached: 19290.0 | max cached: 19290.0 [2023-09-11 15:09:50,297] [INFO] [RANK 0] time (ms) | forward: 556.19 | backward: 686.45 | allreduce: 0.00 | optimizer: 2.67 | batch generator: 8.26 | data loader: 5.40 [2023-09-11 15:10:01,139] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:01,139] [INFO] [timer.py:260:stop] epoch=0/micro_step=60/global_step=60, RunningAvgSamplesPerSec=3.719093807109329, CurrSamplesPerSec=3.693080905980746, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:11,976] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:11,976] [INFO] [timer.py:260:stop] epoch=0/micro_step=70/global_step=70, RunningAvgSamplesPerSec=3.715815161951777, CurrSamplesPerSec=3.7077261237401475, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:22,784] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:22,785] [INFO] [timer.py:260:stop] epoch=0/micro_step=80/global_step=80, RunningAvgSamplesPerSec=3.7146515371242694, CurrSamplesPerSec=3.708240777701597, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:33,575] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:33,575] [INFO] [timer.py:260:stop] epoch=0/micro_step=90/global_step=90, RunningAvgSamplesPerSec=3.714437721805471, CurrSamplesPerSec=3.7065211796116344, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:44,341] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=2, lr=[5e-06], mom=[[0.9, 0.95]] [2023-09-11 15:10:44,342] [INFO] [timer.py:260:stop] epoch=0/micro_step=100/global_step=100, RunningAvgSamplesPerSec=3.7151055645323754, CurrSamplesPerSec=3.722630024110137, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:10:44,342] [INFO] [RANK 0] iteration 100/ 300 | elapsed time per iteration (ms): 1080.9 | learning rate 5.000E-06 | total loss 5.623437E+00 | loss 5.623437E+00 | loss scale 32768.0 |speed 222.03 samples/(minGPU) [2023-09-11 15:10:44,343] [INFO] [RANK 0] time (ms) | forward: 407.80 | backward: 669.69 | allreduce: 0.00 | optimizer: 2.69 | batch generator: 1.32 | data loader: 0.12 [2023-09-11 15:10:55,110] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=2, lr=[7.667891533457719e-05], mom=[[0.9, 0.95]] [2023-09-11 15:10:55,111] [INFO] [timer.py:260:stop] epoch=0/micro_step=110/global_step=110, RunningAvgSamplesPerSec=3.7156135199372278, CurrSamplesPerSec=3.715706181055347, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:05,843] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=2, lr=[7.243820139034464e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:05,843] [INFO] [timer.py:260:stop] epoch=0/micro_step=120/global_step=120, RunningAvgSamplesPerSec=3.717065516728891, CurrSamplesPerSec=3.741957471105421, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:16,552] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=2, lr=[6.800643086250122e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:16,553] [INFO] [timer.py:260:stop] epoch=0/micro_step=130/global_step=130, RunningAvgSamplesPerSec=3.7189077364106304, CurrSamplesPerSec=3.7400955732926238, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:27,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=2, lr=[6.343215915635762e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:27,262] [INFO] [timer.py:260:stop] epoch=0/micro_step=140/global_step=140, RunningAvgSamplesPerSec=3.720500391512484, CurrSamplesPerSec=3.7379098764703196, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:37,972] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=2, lr=[5.876550294995422e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:37,973] [INFO] [timer.py:260:stop] epoch=0/micro_step=150/global_step=150, RunningAvgSamplesPerSec=3.7218470016654877, CurrSamplesPerSec=3.7406559497001184, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:37,973] [INFO] [RANK 0] iteration 150/ 300 | elapsed time per iteration (ms): 1072.6 | learning rate 5.830E-05 | total loss 5.725078E+00 | loss 5.725078E+00 | loss scale 32768.0 |speed 223.75 samples/(minGPU) [2023-09-11 15:11:37,974] [INFO] [RANK 0] time (ms) | forward: 404.25 | backward: 664.93 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.30 | data loader: 0.12 [2023-09-11 15:11:48,664] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=2, lr=[5.4057591105248944e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:48,664] [INFO] [timer.py:260:stop] epoch=0/micro_step=160/global_step=160, RunningAvgSamplesPerSec=3.723471275234616, CurrSamplesPerSec=3.7431104072836328, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:11:59,345] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=2, lr=[4.936000448960631e-05], mom=[[0.9, 0.95]] [2023-09-11 15:11:59,346] [INFO] [timer.py:260:stop] epoch=0/micro_step=170/global_step=170, RunningAvgSamplesPerSec=3.7250850527708503, CurrSamplesPerSec=3.748888274646495, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:10,005] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=2, lr=[4.4724210845020494e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:10,006] [INFO] [timer.py:260:stop] epoch=0/micro_step=180/global_step=180, RunningAvgSamplesPerSec=3.726933918666199, CurrSamplesPerSec=3.7549411520092604, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:20,659] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=2, lr=[4.020100089676376e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:20,660] [INFO] [timer.py:260:stop] epoch=0/micro_step=190/global_step=190, RunningAvgSamplesPerSec=3.728695161247372, CurrSamplesPerSec=3.755563151498546, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:31,302] [INFO] [logging.py:96:log_dist] [Rank 0] step=200, skipped=2, lr=[3.583993187957173e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:31,303] [INFO] [timer.py:260:stop] epoch=0/micro_step=200/global_step=200, RunningAvgSamplesPerSec=3.730479955904745, CurrSamplesPerSec=3.7611173134261437, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:31,303] [INFO] [RANK 0] iteration 200/ 300 | elapsed time per iteration (ms): 1066.6 | learning rate 3.541E-05 | total loss 5.380703E+00 | loss 5.380703E+00 | loss scale 32768.0 |speed 225.02 samples/(minGPU) [2023-09-11 15:12:31,304] [INFO] [RANK 0] time (ms) | forward: 402.37 | backward: 660.79 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.30 | data loader: 0.12 [2023-09-11 15:12:41,936] [INFO] [logging.py:96:log_dist] [Rank 0] step=210, skipped=2, lr=[3.168878457820915e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:41,937] [INFO] [timer.py:260:stop] epoch=0/micro_step=210/global_step=210, RunningAvgSamplesPerSec=3.73226334126383, CurrSamplesPerSec=3.774635326145037, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:12:52,533] [INFO] [logging.py:96:log_dist] [Rank 0] step=220, skipped=2, lr=[2.7793039831193133e-05], mom=[[0.9, 0.95]] [2023-09-11 15:12:52,533] [INFO] [timer.py:260:stop] epoch=0/micro_step=220/global_step=220, RunningAvgSamplesPerSec=3.7344633748176617, CurrSamplesPerSec=3.7779715366600612, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:03,129] [INFO] [logging.py:96:log_dist] [Rank 0] step=230, skipped=2, lr=[2.419538023320901e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:03,129] [INFO] [timer.py:260:stop] epoch=0/micro_step=230/global_step=230, RunningAvgSamplesPerSec=3.7364764349207826, CurrSamplesPerSec=3.7810153868690453, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:13,725] [INFO] [logging.py:96:log_dist] [Rank 0] step=240, skipped=2, lr=[2.093522249567097e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:13,726] [INFO] [timer.py:260:stop] epoch=0/micro_step=240/global_step=240, RunningAvgSamplesPerSec=3.738326810011324, CurrSamplesPerSec=3.780000793075728, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:24,315] [INFO] [logging.py:96:log_dist] [Rank 0] step=250, skipped=2, lr=[1.804828558898332e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:24,315] [INFO] [timer.py:260:stop] epoch=0/micro_step=250/global_step=250, RunningAvgSamplesPerSec=3.740120048593816, CurrSamplesPerSec=3.788265402317096, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:24,316] [INFO] [RANK 0] iteration 250/ 300 | elapsed time per iteration (ms): 1060.3 | learning rate 1.778E-05 | total loss 4.975234E+00 | loss 4.975234E+00 | loss scale 32768.0 |speed 226.36 samples/(minGPU) [2023-09-11 15:13:24,317] [INFO] [RANK 0] time (ms) | forward: 396.91 | backward: 659.91 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.30 | data loader: 0.12 [2023-09-11 15:13:34,913] [INFO] [logging.py:96:log_dist] [Rank 0] step=260, skipped=2, lr=[1.556619939802615e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:34,914] [INFO] [timer.py:260:stop] epoch=0/micro_step=260/global_step=260, RunningAvgSamplesPerSec=3.7416724772945047, CurrSamplesPerSec=3.779847501326996, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:45,509] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=2, lr=[1.3516158178517482e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:45,509] [INFO] [timer.py:260:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=3.7431273152873965, CurrSamplesPerSec=3.770404068756794, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:13:56,088] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=2, lr=[1.1920622611056975e-05], mom=[[0.9, 0.95]] [2023-09-11 15:13:56,088] [INFO] [timer.py:260:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=3.744696499549114, CurrSamplesPerSec=3.7911613770662207, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:14:06,647] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=2, lr=[1.0797073717209012e-05], mom=[[0.9, 0.95]] [2023-09-11 15:14:06,648] [INFO] [timer.py:260:stop] epoch=0/micro_step=290/global_step=290, RunningAvgSamplesPerSec=3.746389873887378, CurrSamplesPerSec=3.8006454920330643, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:14:17,218] [INFO] [logging.py:96:log_dist] [Rank 0] step=300, skipped=2, lr=[1.0157821333772305e-05], mom=[[0.9, 0.95]] [2023-09-11 15:14:17,218] [INFO] [timer.py:260:stop] epoch=0/micro_step=300/global_step=300, RunningAvgSamplesPerSec=3.747843231488021, CurrSamplesPerSec=3.7641205427425337, MemAllocated=14.6GB, MaxMemAllocated=17.35GB [2023-09-11 15:14:17,219] [INFO] [RANK 0] iteration 300/ 300 | elapsed time per iteration (ms): 1058.1 | learning rate 1.012E-05 | total loss 4.691602E+00 | loss 4.691602E+00 | loss scale 32768.0 |speed 226.83 samples/(minGPU) [2023-09-11 15:14:17,219] [INFO] [RANK 0] time (ms) | forward: 398.83 | backward: 655.78 | allreduce: 0.00 | optimizer: 2.68 | batch generator: 1.29 | data loader: 0.12 [2023-09-11 15:14:17,219] [INFO] [RANK 0] Saving Model... /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-09-11 15:14:17,244] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt [2023-09-11 15:14:17,244] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt... [2023-09-11 15:14:31,573] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt. [2023-09-11 15:14:32,121] [INFO] [RANK 0] Saving Model... [2023-09-11 15:14:32,132] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt [2023-09-11 15:14:32,132] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt... [2023-09-11 15:16:30,648] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved ./checkpoints/finetune-visualglm-6b-09-11-14-57/300/mp_rank_00_model_states.pt. 71f453ed31a9:6101:6321 [0] NCCL INFO [Service thread] Connection closed by localRank 0 71f453ed31a9:6101:6101 [0] NCCL INFO comm 0x137af9cf0 rank 0 nranks 1 cudaDev 0 busId 31000 - Abort COMPLETE 71f453ed31a9:6101:6476 [0] NCCL INFO [Service thread] Connection closed by localRank 0 71f453ed31a9:6101:6101 [0] NCCL INFO comm 0xacd76e00 rank 0 nranks 1 cudaDev 0 busId 31000 - Abort COMPLETE [2023-09-11 15:16:34,616] [INFO] [launch.py:347:main] Process 6101 exits successfully.