Open 949491819 opened 1 year ago
(pytorchzdy) [work@gpu-2 chat_generate]$ sh dp_train_glm.sh [2023-05-17 14:37:02,196] [INFO] [runner.py:299:parse_resource_filter] removing 0 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 1 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 2 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 3 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 4 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 5 from gpu-2 [2023-05-17 14:37:14,982] [INFO] [runner.py:454:main] Using IP address of 192.168.10.82 for node gpu-2 [2023-05-17 14:37:14,982] [INFO] [runner.py:550:main] cmd = /home/work/.conda/envs/pytorchzdy/bin/python -u -m deepspeed.launcher.launch --world_info=eyJncHUtMiI6IFs2LCA3XX0= --master_addr=192.168.10.82 --master_port=29500 --enable_each_rank_log=None dp_finetune.py --deepspeed ./config/deepspeed/ds_glm.json --model chatglm --model_path ./chatglm-6b --data_path data/instinwild_ch.json --max_datasets_size 10000 --max_len 128 --lora_rank 0 --pre_seq_len 128 --logging_steps 10 --num_train_epochs 1 --learning_rate 2e-2 --output_dir ./output/chatglm-6b --gradient_accumulation_steps 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --predict_with_generate --max_steps 3000 --save_steps 1000 --grad_checkpointing [2023-05-17 14:37:18,127] [INFO] [launch.py:142:main] WORLD INFO DICT: {'gpu-2': [6, 7]} [2023-05-17 14:37:18,127] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-05-17 14:37:18,127] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'gpu-2': [0, 1]}) [2023-05-17 14:37:18,127] [INFO] [launch.py:162:main] dist_world_size=2 [2023-05-17 14:37:18,127] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=6,7 [2023-05-17 14:37:25,621] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0 [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1 [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ... Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ... Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [2023-05-17 14:37:41,891] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it] [INFO] [05/17/2023 14:38:03] [main] Loading dataset ... [INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders [INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json [INFO] [05/17/2023 14:38:03] [main] Loading dataset ... [INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders [INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json [WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 259.85it/s] [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ... [WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 266.51it/s] [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ... [INFO] [05/17/2023 14:38:06] [dataset.chat_dataset] Tokenizing inputs ... Dataset: 0%| | 0/10000 [00:00<?, ?it/s][INFO] [05/17/2023 14:38:07] [dataset.chat_dataset] Tokenizing inputs ... Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:05<00:00, 1812.93it/s] [input_ids]: [5, 64286, 12, 64157, 68896, 64185, 66731, 79046, 64230, 69551, 63823, 4, 67342, 12, 130001, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] [inputs] : 问:请讲解如何缓解上班族病的症状。 答: 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。 [label_ids]: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100] [labels] :
revision
Is there an existing issue for this?
Current Behavior
(pytorchzdy) [work@gpu-2 chat_generate]$ sh dp_train_glm.sh [2023-05-17 14:37:02,196] [INFO] [runner.py:299:parse_resource_filter] removing 0 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 1 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 2 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 3 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 4 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 5 from gpu-2 [2023-05-17 14:37:14,982] [INFO] [runner.py:454:main] Using IP address of 192.168.10.82 for node gpu-2 [2023-05-17 14:37:14,982] [INFO] [runner.py:550:main] cmd = /home/work/.conda/envs/pytorchzdy/bin/python -u -m deepspeed.launcher.launch --world_info=eyJncHUtMiI6IFs2LCA3XX0= --master_addr=192.168.10.82 --master_port=29500 --enable_each_rank_log=None dp_finetune.py --deepspeed ./config/deepspeed/ds_glm.json --model chatglm --model_path ./chatglm-6b --data_path data/instinwild_ch.json --max_datasets_size 10000 --max_len 128 --lora_rank 0 --pre_seq_len 128 --logging_steps 10 --num_train_epochs 1 --learning_rate 2e-2 --output_dir ./output/chatglm-6b --gradient_accumulation_steps 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --predict_with_generate --max_steps 3000 --save_steps 1000 --grad_checkpointing [2023-05-17 14:37:18,127] [INFO] [launch.py:142:main] WORLD INFO DICT: {'gpu-2': [6, 7]} [2023-05-17 14:37:18,127] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-05-17 14:37:18,127] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'gpu-2': [0, 1]}) [2023-05-17 14:37:18,127] [INFO] [launch.py:162:main] dist_world_size=2 [2023-05-17 14:37:18,127] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=6,7 [2023-05-17 14:37:25,621] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0 [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1 [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ... Explicitly passing a
revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ... Explicitly passing arevision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model Explicitly passing arevision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model Explicitly passing arevision
is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [2023-05-17 14:37:41,891] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it] [INFO] [05/17/2023 14:38:03] [main] Loading dataset ... [INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders [INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json [INFO] [05/17/2023 14:38:03] [main] Loading dataset ... [INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders [INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json [WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 259.85it/s] [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ... [WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 266.51it/s] [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ... [INFO] [05/17/2023 14:38:06] [dataset.chat_dataset] Tokenizing inputs ... Dataset: 0%| | 0/10000 [00:00<?, ?it/s][INFO] [05/17/2023 14:38:07] [dataset.chat_dataset] Tokenizing inputs ... Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:05<00:00, 1812.93it/s] [input_ids]: [5, 64286, 12, 64157, 68896, 64185, 66731, 79046, 64230, 69551, 63823, 4, 67342, 12, 130001, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] [inputs] : 问:请讲解如何缓解上班族病的症状。 答: 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。 [label_ids]: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100] [labels] :