train.sh 跑完了，没有看到报错，但是输出目录没有模型文件生成，这是什么情况？

lilongwei5054 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

train.sh 跑完了，没有看到报错，但是输出目录没有模型文件生成，试了好几次，都这样。

Expected Behavior

No response

Steps To Reproduce

(base) root@ThinkStation-K-C2:/home/pypro/chatglm6b/ChatGLM-6B/ptuning# bash train.sh 2023-06-03 10:13:35.134319: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-03 10:13:35.207116: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 06/03/2023 10:13:36 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False 06/03/2023 10:13:36 - INFO - main - Training/evaluation parameters Seq2SeqTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=16, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.02, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=cus/out/checkpoint20/runs/Jun03_10-13-36_ThinkStation-K-C2, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=100, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_hf, optim_args=None, output_dir=cus/out/checkpoint20, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, predict_with_generate=True, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard', 'wandb'], resume_from_checkpoint=None, run_name=cus/out/checkpoint20, save_on_each_node=False, save_steps=1000, save_strategy=steps, save_total_limit=None, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-0713b568e641e634/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4... Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4801.72it/s] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 877.74it/s] Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-0713b568e641e634/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data. 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 405.27it/s] [INFO|configuration_utils.py:666] 2023-06-03 10:13:37,854 >> loading configuration file /home/pypro/chatglm6b/ChatGLM-6B/THUDM/chatglm-6b/config.json [WARNING|configuration_auto.py:905] 2023-06-03 10:13:37,854 >> Explicitly passing a revision is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|configuration_utils.py:666] 2023-06-03 10:13:37,891 >> loading configuration file /home/pypro/chatglm6b/ChatGLM-6B/THUDM/chatglm-6b/config.json [INFO|configuration_utils.py:720] 2023-06-03 10:13:37,892 >> Model config ChatGLMConfig { "_name_or_path": "/home/pypro/chatglm6b/ChatGLM-6B/THUDM/chatglm-6b", "architectures": [ "ChatGLMModel" ], "auto_map": { "AutoConfig": "configuration_chatglm.ChatGLMConfig", "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration" }, "bos_token_id": 130004, "eos_token_id": 130005, "gmask_token_id": 130001, "hidden_size": 4096, "inner_hidden_size": 16384, "layernorm_epsilon": 1e-05, "mask_token_id": 130000, "max_sequence_length": 2048, "model_type": "chatglm", "num_attention_heads": 32, "num_layers": 28, "pad_token_id": 3, "position_encoding_2d": true, "pre_seq_len": null, "prefix_projection": false, "quantization_bit": 0, "torch_dtype": "float16", "transformers_version": "4.27.1", "use_cache": true, "vocab_size": 130528 }

[WARNING|tokenization_auto.py:652] 2023-06-03 10:13:37,892 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|tokenization_utils_base.py:1800] 2023-06-03 10:13:37,915 >> loading file ice_text.model [INFO|tokenization_utils_base.py:1800] 2023-06-03 10:13:37,915 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1800] 2023-06-03 10:13:37,915 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1800] 2023-06-03 10:13:37,915 >> loading file tokenizer_config.json [WARNING|auto_factory.py:456] 2023-06-03 10:13:38,022 >> Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. [INFO|modeling_utils.py:2400] 2023-06-03 10:13:38,061 >> loading weights file /home/pypro/chatglm6b/ChatGLM-6B/THUDM/chatglm-6b/pytorch_model.bin.index.json [INFO|configuration_utils.py:575] 2023-06-03 10:13:38,061 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 130004, "eos_token_id": 130005, "pad_token_id": 3, "transformers_version": "4.27.1" }

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:06<00:00, 1.16it/s] [INFO|modeling_utils.py:3032] 2023-06-03 10:13:45,136 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[WARNING|modeling_utils.py:3034] 2023-06-03 10:13:45,136 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /home/pypro/chatglm6b/ChatGLM-6B/THUDM/chatglm-6b and are newly initialized: ['transformer.prefix_encoder.embedding.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|modeling_utils.py:2690] 2023-06-03 10:13:45,159 >> Generation config file not found, using a generation config created from the model config. Quantized to 4 bit input_ids [53, 6945, 5, 8, 42, 4, 64286, 12, 65309, 64013, 63826, 86664, 64523, 69773, 4, 67342, 12, 130001, 130004, 5, 72176, 64095, 64685, 6, 86664, 71073, 66201, 63920, 63826, 85759, 6, 63856, 66461, 67464, 66201, 81363, 76247, 89580, 6, 71592, 58, 65925, 69367, 63825, 94584, 125779, 78495, 9, 18, 9, 9, 63830, 108750, 63842, 6, 64823, 63828, 66116, 67464, 66201, 6, 67494, 66116, 65320, 63826, 67464, 6, 69548, 109995, 6, 5, 86664, 64829, 63903, 64710, 67407, 6, 63829, 103615, 65250, 63825, 101004, 6, 64601, 68231, 58, 65309, 64013, 67407, 125779, 64057, 58, 63847, 63883, 65925, 71057, 95082, 117928, 63824, 63847, 63883, 68579, 71057, 95082, 64710, 64699, 67407, 6, 63847, 63883, 65925, 71057, 95082, 70040, 67407, 125779, 64601, 71073, 87853, 93893, 63826, 84772, 6, 66461, 67464, 74029, 82882, 72439, 64799, 66759, 130005, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] inputs [Round 0] 问:孙文和孙中山那个伟大答: 他们是同一个人,孙中山是一位革命家和政治家,他在中国民主革命的道路上起到了重要作用,被誉为“民族解放的先驱”, 他在1911年辛亥革命后,领导了中国的民主革命,实现了中国的统一和民主,建立了中华民国, 孙中山提出三民主义,是中国现代政治的理论基础, 他提出了“孙文主义”,即“以中国民族的利益为中心的民族主义、以中国民众的利益为中心的民权主义,以中国民族的利益为中心的民生主义”, 他是一位杰出的思想家和教育家,在中国民主教育的发展中做出了重要贡献 labelids [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 130004, 5, 72176, 64095, 64685, 6, 86664, 71073, 66201, 63920, 63826, 85759, 6, 63856, 66461, 67464, 66201, 81363, 76247, 89580, 6, 71592, 58, 65925, 69367, 63825, 94584, 125779, 78495, 9, 18, 9, 9, 63830, 108750, 63842, 6, 64823, 63828, 66116, 67464, 66201, 6, 67494, 66116, 65320, 63826, 67464, 6, 69548, 109995, 6, 5, 86664, 64829, 63903, 64710, 67407, 6, 63829, 103615, 65250, 63825, 101004, 6, 64601, 68231, 58, 65309, 64013, 67407, 125779, 64057, 58, 63847, 63883, 65925, 71057, 95082, 117928, 63824, 63847, 63883, 68579, 71057, 95082, 64710, 64699, 67407, 6, 63847, 63883, 65925, 71057, 95082, 70040, 67407, 125779, 64601, 71073, 87853, 93893, 63826, 84772, 6, 66461, 67464, 74029, 82882, 72439, 64799, 66759, 130005, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100] labels <image-100> 他们是同一个人,孙中山是一位革命家和政治家,他在中国民主革命的道路上起到了重要作用,被誉为“民族解放的先驱”, 他在1911年辛亥革命后,领导了中国的民主革命,实现了中国的统一和民主,建立了中华民国, 孙中山提出三民主义,是中国现代政治的理论基础, 他提出了“孙文主义”,即“以中国民族的利益为中心的民族主义、以中国民众的利益为中心的民权主义,以中国民族的利益为中心的民生主义”, 他是一位杰出的思想家和教育家,在中国民主教育的发展中做出了重要贡献 /usr/local/software/anaconda/install/lib/python3.9/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( [INFO|integrations.py:709] 2023-06-03 10:14:34,414 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Tracking run with wandb version 0.15.2 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. 0%| | 0/100 [00:00<?, ?it/s]06/03/2023 10:14:39 - WARNING - transformers_modules.chatglm-6b.modeling_chatglm - use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... {'loss': 1.0515, 'learning_rate': 0.018000000000000002, 'epoch': 10.0}
{'loss': 0.1403, 'learning_rate': 0.016, 'epoch': 20.0}
{'loss': 0.011, 'learning_rate': 0.013999999999999999, 'epoch': 30.0}
{'loss': 0.0041, 'learning_rate': 0.012, 'epoch': 40.0}
{'loss': 0.0033, 'learning_rate': 0.01, 'epoch': 50.0}
{'loss': 0.0028, 'learning_rate': 0.008, 'epoch': 60.0}
{'loss': 0.0025, 'learning_rate': 0.006, 'epoch': 70.0}
{'loss': 0.002, 'learning_rate': 0.004, 'epoch': 80.0}
{'loss': 0.0026, 'learning_rate': 0.002, 'epoch': 90.0}
{'loss': 0.0023, 'learning_rate': 0.0, 'epoch': 100.0}
{'train_runtime': 740.7429, 'train_samples_per_second': 2.16, 'train_steps_per_second': 0.135, 'train_loss': 0.12223636329174042, 'epoch': 100.0}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [12:15<00:00, 7.36s/it] train metrics epoch = 100.0 train_loss = 0.1222 train_runtime = 0:12:20.74 train_samples = 8 train_samples_per_second = 2.16 train_steps_per_second = 0.135 wandb: Waiting for W&B process to finish... (success). wandb: wandb: Run history: wandb: train/epoch ▁▂▃▃▄▅▆▆▇██ wandb: train/global_step ▁▂▃▃▄▅▆▆▇██ wandb: train/learning_rate █▇▆▆▅▄▃▃▂▁ wandb: train/loss █▂▁▁▁▁▁▁▁▁ wandb: train/total_flos ▁ wandb: train/train_loss ▁ wandb: train/train_runtime ▁ wandb: train/train_samples_per_second ▁ wandb: train/train_steps_per_second ▁ wandb: wandb: Run summary: wandb: train/epoch 100.0 wandb: train/global_step 100 wandb: train/learning_rate 0.0 wandb: train/loss 0.0023 wandb: train/total_flos 6933144246681600.0 wandb: train/train_loss 0.12224 wandb: train/train_runtime 740.7429 wandb: train/train_samples_per_second 2.16 wandb: train/train_steps_per_second 0.135 wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/pypro/chatglm6b/ChatGLM-6B/ptuning/wandb/offline-run-20230603_101435-chuqiuv6 wandb: Find logs at: ./wandb/offline-run-20230603_101435-chuqiuv6/logs (base) root@ThinkStation-K-C2:/home/pypro/chatglm6b/ChatGLM-6B/ptuning#

Environment

- OS:ubuntu20
- Python:3.9
- Transformers:4.27.1
- PyTorch:2.0
- CUDA Support (`True`) :

Anything else?

但显卡 V100 32G

tianlichunhong commented 1 year ago

如果正常的话，输出在ptuning\output\adgen-chatglm-6b-pt-128-2e-2\的目录中，有例如有checkpoint-3000这个文件夹。

lilongwei5054 commented 1 year ago

@tianlichunhong 我现在发现用 train_chat.sh 跑，有新的模型文件生成，用train.sh 没有，应该是跟我的数据json文件有关系，我的数据格式是Q&A形式，例如{"prompt":"123","response":"abc"}。而train.sh 应该需要作者网盘提供的那种数据格式。

Vincent-Huang-2000 commented 1 year ago

你的 save_steps 是 1000，但是 max_steps 是 100 ，所以你的 checkpoint 根本没保存下来... 同时如果你的数据格式是 {"prompt":"123","response":"abc"} 需要更改 train.sh 文件里的 prompt_column 和 response_column 😢😢 别问我是怎么知道的，因为我刚才就忘记保存了...

fthjane commented 1 year ago

@Vincent-Huang-2000 请问跑p-tuning的时候需要设置resume_from_checkpoint 这个参数吗，应该怎样设置呢？我跑的时候提示│ 706 │ if not is_initialized(): │ │ ❱ 707 │ │ raise RuntimeError( │ │ 708 │ │ │ "Default process group has not been initialized, " │ │ 709 │ │ │ "please make sure to call init_process_group." │ │ 710 │ │ ) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 这个应该怎么解决呀

fthjane commented 1 year ago

resume_from_checkpoint 设置为默认下载模型的地址，也是不对，会提示RuntimeError: Error(s) in loading state_dict for ChatGLMForConditionalGeneration: size mismatch for transformer.layers.25.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([16384, 2048]). size mismatch for transformer.layers.25.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([4096, 8192]). size mismatch for transformer.layers.26.attention.query_key_value.weight: copying a param with shape torch.Size([12288, 4096]) from checkpoint, the shape in current model is torch.Size([12288, 2048]). size mismatch for transformer.layers.26.attention.dense.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 2048]). size mismatch for transformer.layers.26.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([16384, 2048]). size mismatch for transformer.layers.26.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([4096, 8192]). size mismatch for transformer.layers.27.attention.query_key_value.weight: copying a param with shape torch.Size([12288, 4096]) from checkpoint, the shape in current model is torch.Size([12288, 2048]). size mismatch for transformer.layers.27.attention.dense.weight: copying a param with shape torch.Size([4096, 4096]) from checkpoint, the shape in current model is torch.Size([4096, 2048]). size mismatch for transformer.layers.27.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([16384, 4096]) from checkpoint, the shape in current model is torch.Size([16384, 2048]). size mismatch for transformer.layers.27.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([4096, 16384]) from checkpoint, the shape in current model is torch.Size([4096, 8192]).

Vincent-Huang-2000 commented 1 year ago

@Vincent-Huang-2000 请问跑p-tuning的时候需要设置resume_from_checkpoint 这个参数吗，应该怎样设置呢？我跑的时候提示│ 706 │ if not is_initialized(): │ │ ❱ 707 │ │ raise RuntimeError( │ │ 708 │ │ │ "Default process group has not been initialized, " │ │ 709 │ │ │ "please make sure to call init_process_group." │ │ 710 │ │ ) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Default process group has not been initialized, please make sure to call init_process_group. 这个应该怎么解决呀

我没设置过 resume_from_checkpoint 就保持默认即可。你后面这个问题我没遇到过。

fthjane commented 1 year ago

我没设置过 resume_from_checkpoint 就保持默认即可。你后面这个问题我没遇到过。

谢谢哈~ 这个问题解决了，是transformer 版本的问题，改成需要的版本就可以了

TommyWongww commented 1 year ago

我没设置过 resume_from_checkpoint 就保持默认即可。你后面这个问题我没遇到过。

谢谢哈~ 这个问题解决了，是transformer 版本的问题，改成需要的版本就可以了

请问改成多少版本的呀

fthjane commented 1 year ago

我的是4.28

THUDM / ChatGLM-6B