v1.1版本deepspeed微调报错RuntimeError: output tensor must have the same type as input tensor

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

(pytorchzdy) [work@gpu-2 chat_generate]$ sh dp_train_glm.sh [2023-05-17 14:37:02,196] [INFO] [runner.py:299:parse_resource_filter] removing 0 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 1 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 2 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 3 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 4 from gpu-2 [2023-05-17 14:37:02,197] [INFO] [runner.py:299:parse_resource_filter] removing 5 from gpu-2 [2023-05-17 14:37:14,982] [INFO] [runner.py:454:main] Using IP address of 192.168.10.82 for node gpu-2 [2023-05-17 14:37:14,982] [INFO] [runner.py:550:main] cmd = /home/work/.conda/envs/pytorchzdy/bin/python -u -m deepspeed.launcher.launch --world_info=eyJncHUtMiI6IFs2LCA3XX0= --master_addr=192.168.10.82 --master_port=29500 --enable_each_rank_log=None dp_finetune.py --deepspeed ./config/deepspeed/ds_glm.json --model chatglm --model_path ./chatglm-6b --data_path data/instinwild_ch.json --max_datasets_size 10000 --max_len 128 --lora_rank 0 --pre_seq_len 128 --logging_steps 10 --num_train_epochs 1 --learning_rate 2e-2 --output_dir ./output/chatglm-6b --gradient_accumulation_steps 1 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --predict_with_generate --max_steps 3000 --save_steps 1000 --grad_checkpointing [2023-05-17 14:37:18,127] [INFO] [launch.py:142:main] WORLD INFO DICT: {'gpu-2': [6, 7]} [2023-05-17 14:37:18,127] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0 [2023-05-17 14:37:18,127] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'gpu-2': [0, 1]}) [2023-05-17 14:37:18,127] [INFO] [launch.py:162:main] dist_world_size=2 [2023-05-17 14:37:18,127] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=6,7 [2023-05-17 14:37:25,621] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 0 [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:1 to store for rank: 1 [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [INFO] [05/17/2023 14:37:28] [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. [INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ... Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Loading model, config and tokenizer ... Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO] [05/17/2023 14:37:28] [main] Use P-Tuning v2 to fine-tune model Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [2023-05-17 14:37:41,891] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s]/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( /home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00, 2.69s/it] [INFO] [05/17/2023 14:38:03] [main] Loading dataset ... [INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders [INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json [INFO] [05/17/2023 14:38:03] [main] Loading dataset ... [INFO] [05/17/2023 14:38:03] [dataset.data_loader] Building chaglm dataloaders [INFO] [05/17/2023 14:38:03] [dataset.chat_dataset] Loading json data: data/instinwild_ch.json [WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 259.85it/s] [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ... [WARNING] [05/17/2023 14:38:04] [datasets.builder] Found cached dataset json (/home/work/.cache/huggingface/datasets/json/default-a8d4b15460af874d/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 266.51it/s] [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Loaded 51504 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Limiting dataset to 10000 examples. [INFO] [05/17/2023 14:38:04] [dataset.chat_dataset] Formatting ChatGLM inputs ... [INFO] [05/17/2023 14:38:06] [dataset.chat_dataset] Tokenizing inputs ... Dataset: 0%| | 0/10000 [00:00<?, ?it/s][INFO] [05/17/2023 14:38:07] [dataset.chat_dataset] Tokenizing inputs ... Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:05<00:00, 1812.93it/s] [input_ids]: [5, 64286, 12, 64157, 68896, 64185, 66731, 79046, 64230, 69551, 63823, 4, 67342, 12, 130001, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] [inputs] : 问:请讲解如何缓解上班族病的症状。答: 一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。 [label_ids]: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 64276, 67446, 66731, 79046, 64230, 69551, 78511, 66136, 71738, 6, 83710, 63878, 63979, 83428, 63871, 71738, 6, 64651, 87699, 66265, 64748, 6, 66544, 66351, 64950, 63824, 81883, 63826, 71821, 84329, 63825, 70797, 63823, 66018, 6, 63878, 74017, 64351, 97701, 63944, 80799, 64288, 64786, 63944, 66057, 64258, 6, 64529, 91081, 6, 64024, 71377, 63835, 6, 64542, 77280, 6, 64287, 66715, 63841, 64987, 65878, 86089, 6, 63899, 80732, 66372, 64700, 6, 97827, 69514, 74470, 63823, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100] [labels] :

一种有效的缓解上班族病的症状方法是出去散步,每天晚上可以花几个小时去散步,减少坐姿固定的时间,放松肩痛、腰痛和背部综合症的发作。另外,可以试着利用午休时间或者其他空余时间锻炼一下,比如慢跑,打太极拳等,帮助舒缓,运动释放时也可以练习深呼吸,这能帮助消除压力,更有利于解除病症。

[INFO] [05/17/2023 14:38:12] [dataset.data_loader] Loaded 9000 training examples, 1000 evaluation examples [input_ids]: [5, 64286, 12, 64157, 64201, 73848, 70522, 71039, 70022, 71529, 12, 7457, 63824, 11329, 63824, 4218, 63824, 3802, 63824, 49241, 63823, 4, 67342, 12, 130001, 130004, 11632, 63824, 49241, 63824, 3802, 63824, 11329, 63824, 4218, 130005, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] [inputs] : 问:请按字母顺序排列下列单词:apple、dog、tree、cat、banana。答: apple、banana、cat、dog、tree [label_ids]: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 11632, 63824, 49241, 63824, 3802, 63824, 11329, 63824, 4218, 130005, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100] [labels ]:

apple、banana、cat、dog、tree

[INFO] [05/17/2023 14:38:12] [__main__] Start to train ... [INFO] [05/17/2023 14:38:12] [__main__] Training argments: Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=./config/deepspeed/ds_glm.json, disable_tqdm=False, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.02, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=./output/chatglm-6b/runs/May17_14-37-25_gpu-2, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=3000, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, optim=adamw_hf, optim_args=None, output_dir=./output/chatglm-6b, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=2, predict_with_generate=True, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard', 'wandb'], resume_from_checkpoint=None, run_name=./output/chatglm-6b, save_on_each_node=False, save_safetensors=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) [INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Added key: store_based_barrier_key:2 to store for rank: 0 [INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes. [INFO] [05/17/2023 14:38:12] [torch.distributed.distributed_c10d] Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes. Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 0.46748924255371094 seconds Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 0.444105863571167 seconds Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.020650863647460938 seconds Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... Emitting ninja build file /home/work/.cache/torch_extensions/py310_cu116/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.01944255828857422 seconds Parameter Offload: Total persistent parameters: 1499136 in 226 params Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0006818771362304688 seconds /home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2387: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead. warnings.warn( Traceback (most recent call last): File "/home/work/liuwc/chat_generate/dp_finetune.py", line 321, in train(args, training_args) File "/home/work/liuwc/chat_generate/dp_finetune.py", line 200, in train train_results = trainer.train() File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train return inner_training_loop( File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step loss = self.compute_loss(model, inputs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss outputs = model(**inputs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward loss = self.module(*inputs, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/home/work/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 1190, in forward transformer_outputs = self.transformer( File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl result = forward_call(*input, **kwargs) File "/home/work/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 926, in forward inputs_embeds = self.word_embeddings(input_ids) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1201, in _call_impl result = hook(self, input) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in _pre_forward_module_hook self.pre_sub_module_forward_function(module) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 478, in pre_sub_module_forward_function param_coordinator.fetch_sub_module(sub_module) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 258, in fetch_sub_module self.__all_gather_params(params_to_fetch) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 399, in __all_gather_params handle = partitioned_params[0].all_gather_coalesced(partitioned_params) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 848, in all_gather_coalesced handle = _dist_allgather_fn( File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 43, in _dist_allgather_fn return instrument_w_nvtx(dist.allgather_fn)(output_tensor, File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 332, in allgather_fn return all_gather_base(output_tensor, File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 123, in log_wrapper return func(*args, **kwargs) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 311, in all_gather_base return cdb.all_gather_base(output_tensor=output_tensor, File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 91, in all_gather_base return torch.distributed.distributed_c10d._all_gather_base( File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2392, in _all_gather_base return all_gather_into_tensor(output_tensor, input_tensor, group, async_op) File "/home/work/.conda/envs/pytorchzdy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2358, in all_gather_into_tensor work = group._allgather_base(output_tensor, input_tensor) RuntimeError: output tensor must have the same type as input tensor Using /home/work/.cache/torch_extensions/py310_cu116 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.0003590583801269531 seconds wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: [2023-05-17 14:38:20,199] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31043 [2023-05-17 14:38:22,921] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 31044 [2023-05-17 14:38:22,922] [ERROR] [launch.py:324:sigkill_handler] ['/home/work/.conda/envs/pytorchzdy/bin/python', '-u', 'dp_finetune.py', '--local_rank=1', '--deepspeed', './config/deepspeed/ds_glm.json', '--model', 'chatglm', '--model_path', './chatglm-6b', '--data_path', 'data/instinwild_ch.json', '--max_datasets_size', '10000', '--max_len', '128', '--lora_rank', '0', '--pre_seq_len', '128', '--logging_steps', '10', '--num_train_epochs', '1', '--learning_rate', '2e-2', '--output_dir', './output/chatglm-6b', '--gradient_accumulation_steps', '1', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '1', '--predict_with_generate', '--max_steps', '3000', '--save_steps', '1000', '--grad_checkpointing'] exits with return code = 1 (pytorchzdy) [work@gpu-2 chat_generate]$ ### Expected Behavior 正常运行 ### Steps To Reproduce 1. 使用了P-Tuning 2. 使用了trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=valid_dataset, tokenizer=tokenizer, data_collator=data_collator, compute_metrics=None ) 3. trainer.train()报错 ### Environment ```markdown PyTorch version: 1.13.1 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A OS: CentOS Linux 7 (Core) (x86_64) GCC version: (GCC) 6.1.0 Clang version: Could not collect CMake version: version 3.26.1 Libc version: glibc-2.17 Python version: 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.10.0-862.6.3.el7.x86_64-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1080 Ti GPU 1: NVIDIA GeForce GTX 1080 Ti GPU 2: NVIDIA GeForce GTX 1080 Ti GPU 3: NVIDIA GeForce GTX 1080 Ti GPU 4: NVIDIA GeForce GTX 1080 Ti GPU 5: NVIDIA GeForce GTX 1080 Ti GPU 6: NVIDIA GeForce GTX 1080 Ti GPU 7: NVIDIA GeForce GTX 1080 Ti Nvidia driver version: 510.85.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] numpy==1.23.5 [pip3] pytorch-accelerated==0.1.45 [pip3] pytorch-pretrained-bert==0.6.2 [pip3] torch==1.13.1 [pip3] torchaudio==0.13.1 [pip3] torchvision==0.14.1 [conda] blas 1.0 mkl [conda] ffmpeg 4.3 hf484d3e_0 pytorch [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py310h7f8727e_0 [conda] mkl_fft 1.3.1 py310hd6ae3a3_0 [conda] mkl_random 1.2.2 py310h00e6091_0 [conda] numpy 1.23.5 py310hd5efca6_0 [conda] numpy-base 1.23.5 py310h8e6c178_0 [conda] pytorch 1.13.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch [conda] pytorch-accelerated 0.1.45 pypi_0 pypi [conda] pytorch-cuda 11.6 h867d48c_1 pytorch [conda] pytorch-mutex 1.0 cuda pytorch [conda] pytorch-pretrained-bert 0.6.2 pypi_0 pypi [conda] torchaudio 0.13.1 py310_cu116 pytorch [conda] torchvision 0.14.1 py310_cu116 pytorch ``` ### Anything else? { "train_micro_batch_size_per_gpu": "auto", "zero_allow_untested_optimizer": true, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 5e8, "reduce_scatter": true, "contiguous_gradients" : true, "overlap_comm": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

THUDM / ChatGLM-6B

v1.1版本deepspeed微调报错RuntimeError: output tensor must have the same type as input tensor #1040

Is there an existing issue for this?

Current Behavior