THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
4.52k stars 353 forks source link

sft.yaml出现RuntimeError: 'weight' must be 2-D #271

Closed zzx528 closed 2 months ago

zzx528 commented 2 months ago

System Info / 系統信息

accelerate 0.31.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.6.2 charset-normalizer 3.3.2 click 8.1.7 contourpy 1.2.1 cycler 0.12.1 datasets 2.20.0 deepspeed 0.14.4 dill 0.3.8 distro 1.9.0 dnspython 2.6.1 einops 0.8.0 email_validator 2.2.0 exceptiongroup 1.2.1 fastapi 0.111.0 fastapi-cli 0.0.4 ffmpy 0.3.2 filelock 3.15.4 fonttools 4.53.0 frozenlist 1.4.1 fsspec 2024.5.0 gradio 4.37.2 gradio_client 1.0.2 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 idna 3.7 importlib_resources 6.4.0 jieba 0.42.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.0 mdurl 0.1.2 mpi4py 3.1.4 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.3 ninja 1.11.1.1 nltk 3.8.1 numpy 2.0.0 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 openai 1.35.7 orjson 3.10.5 packaging 24.1 pandas 2.2.2 peft 0.11.1 pillow 10.3.0 pip 24.0 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.4 pydantic_core 2.18.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.18.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.5.0 safetensors 0.4.3 scikit-learn 1.5.0 scipy 1.14.0 semantic-version 2.10.0 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 69.5.1 shellingham 1.5.4 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.2 starlette 0.37.2 sympy 1.12.1 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.1 torchvision 0.18.1 tqdm 4.66.4 transformers 4.40.0 triton 2.3.1 typer 0.12.3 typing_extensions 4.12.2 tzdata 2024.1 ujson 5.10.0 urllib3 2.2.2 uvicorn 0.30.1 uvloop 0.19.0 watchfiles 0.22.0 websockets 11.0.3 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

CUDA_VISIBLE_DEVICES=0 python finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /root/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml

Expected behavior / 期待表现

跑通

zRzRzRzRzRzRzR commented 2 months ago

给个完整的报错,以及硬件配置,未能复现

zzx528 commented 2 months ago

给个完整的报错,以及硬件配置,未能复现

accelerate 0.31.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 blinker 1.8.2 cachetools 5.3.3 certifi 2024.6.2 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.5.1 comm 0.2.2 contourpy 1.2.1 cycler 0.12.1 dataclasses-json 0.6.7 datasets 2.20.0 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.14.3 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 einops 0.8.0 email_validator 2.1.2 exceptiongroup 1.2.1 executing 2.0.1 fastapi 0.111.0 fastapi-cli 0.0.4 ffmpy 0.3.2 filelock 3.15.1 fonttools 4.53.0 frozenlist 1.4.1 fsspec 2024.5.0 gitdb 4.0.11 GitPython 3.1.43 gradio 4.36.1 gradio_client 1.0.1 greenlet 3.0.3 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 idna 3.7 importlib_resources 6.4.0 interegular 0.3.3 ipykernel 6.29.4 ipython 8.25.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.4 joblib 1.4.2 jsonpatch 1.33 jsonpointer 3.0.0 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.2 jupyter_core 5.7.2 kiwisolver 1.4.5 langchain 0.2.5 langchain-community 0.2.5 langchain-core 0.2.8 langchain-text-splitters 0.2.1 langsmith 0.1.78 lark 1.1.9 llvmlite 0.43.0 lm-format-enforcer 0.10.1 lxml 5.2.2 markdown-it-py 3.0.0 MarkupSafe 2.1.5 marshmallow 3.21.3 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdurl 0.1.2 mpi4py 3.1.4 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.8.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 openai 1.34.0 orjson 3.10.5 outlines 0.0.45 packaging 24.1 pandas 2.2.2 parso 0.8.4 peft 0.11.1 pexpect 4.9.0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 prompt_toolkit 3.0.47 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 16.1.0 pyarrow-hotfix 0.6 pycountry 24.6.1 pydantic 2.7.4 pydantic_core 2.18.4 pydeck 0.9.1 pydub 0.25.1 Pygments 2.18.0 PyJWT 2.8.0 PyMuPDF 1.24.5 PyMuPDFb 1.24.3 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-docx 1.1.2 python-dotenv 1.0.1 python-multipart 0.0.9 python-pptx 0.6.23 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.3 ray 2.24.0 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.18.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.4.9 safetensors 0.4.3 scikit-learn 1.5.0 scipy 1.13.1 semantic-version 2.10.0 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 69.5.1 shellingham 1.5.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 SQLAlchemy 2.0.30 sse-starlette 2.1.2 stack-data 0.6.3 starlette 0.37.2 streamlit 1.35.0 sympy 1.12.1 tenacity 8.4.1 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.3 tokenizers 0.19.1 toml 0.10.2 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.0 torchvision 0.18.1 tornado 6.4.1 tqdm 4.66.4 traitlets 5.14.3 transformers 4.40.0 triton 2.3.0 typer 0.12.3 typing_extensions 4.12.2 typing-inspect 0.9.0 tzdata 2024.1 ujson 5.10.0 urllib3 2.2.1 uvicorn 0.30.1 uvloop 0.19.0 vllm 0.5.0.post1 vllm-flash-attn 2.5.9 watchdog 4.0.1 watchfiles 0.22.0 wcwidth 0.2.13 websockets 11.0.3 wheel 0.43.0 xformers 0.0.26.post1 XlsxWriter 3.2.0 xxhash 3.4.1 yarl 1.9.4 zhipuai 2.1.0.20240521

sft.yaml data_config: train_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl val_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl test_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output max_steps: 3000

needed to be fit for the dataset

learning_rate: 5e-5

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 16 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: configs/ds_zero_3.json

运行命令: CUDA_VISIBLE_DEVICES=3 python finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /root/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml

部分出错日志: rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in _call_impl │ rank0: │ │ rank0: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank0: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank0: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank0: │ ❱ 1541 │ │ │ return forward_call(*args, kwargs) │ rank0: │ 1542 │ │ │ rank0: │ 1543 │ │ try: │ rank0: │ 1544 │ │ │ result = None │ rank0: │ │ rank0: │ /root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py:823 in │ rank0: │ forward │ rank0: │ │ rank0: │ 820 │ │ rank0: │ 821 │ def forward(self, input_ids): │ rank0: │ 822 │ │ # Embeddings. │ rank0: │ ❱ 823 │ │ words_embeddings = self.word_embeddings(input_ids) │ rank0: │ 824 │ │ embeddings = words_embeddings │ rank0: │ 825 │ │ # If the input flag for fp32 residual connection is set, convert for float. │ rank0: │ 826 │ │ if self.fp32_residual_connection: │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank0: │ _wrapped_call_impl │ rank0: │ │ rank0: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank0: │ 1530 │ │ │ return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] │ rank0: │ 1531 │ │ else: │ rank0: │ ❱ 1532 │ │ │ return self._call_impl(args, kwargs) │ rank0: │ 1533 │ │ rank0: │ 1534 │ def _call_impl(self, *args, *kwargs): │ rank0: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/module.py:1582 in _call_impl │ rank0: │ │ rank0: │ 1579 │ │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hoo │ rank0: │ 1580 │ │ │ │ args = bw_hook.setup_input_hook(args) │ rank0: │ 1581 │ │ │ │ rank0: │ ❱ 1582 │ │ │ result = forward_call(args, *kwargs) │ rank0: │ 1583 │ │ │ if _global_forward_hooks or self._forward_hooks: │ rank0: │ 1584 │ │ │ │ for hook_id, hook in ( │ rank0: │ 1585 │ │ │ │ │ _global_forward_hooks.items(), │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/sparse.py:163 in forward │ rank0: │ │ rank0: │ 160 │ │ │ │ self.weight[self.paddingidx].fill(0) │ rank0: │ 161 │ │ rank0: │ 162 │ def forward(self, input: Tensor) -> Tensor: │ rank0: │ ❱ 163 │ │ return F.embedding( │ rank0: │ 164 │ │ │ input, self.weight, self.padding_idx, self.max_norm, │ rank0: │ 165 │ │ │ self.norm_type, self.scale_grad_by_freq, self.sparse) │ rank0: │ 166 │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/functional.py:2264 in embedding │ rank0: │ │ rank0: │ 2261 │ │ # torch.embeddingrenormrank0: │ 2262 │ │ # remove once script supports set_grad_enabled │ rank0: │ 2263 │ │ _no_grad_embeddingrenorm(weight, input, max_norm, norm_type) │ rank0: │ ❱ 2264 │ return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) │ rank0: │ 2265 │ rank0: │ 2266 │ rank0: │ 2267 def embedding_bag( │

rank0: RuntimeError: 'weight' must be 2-D

zRzRzRzRzRzRzR commented 2 months ago

https://github.com/huggingface/transformers/issues/24643 https://github.com/microsoft/DeepSpeed/issues/2746 还有,只有一张卡不要用ds 官方demo中ds是8卡的

zzx528 commented 2 months ago

huggingface/transformers#24643 microsoft/DeepSpeed#2746 还有,只有一张卡不要用ds 官方demo中ds是8卡的

一张80G的A100会爆显存

运行指令: CUDA_VISIBLE_DEVICES=3 python finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /root/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml

运行脚本,去掉了ds,会爆显存 data_config: train_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl val_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl test_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl num_proc: 1 max_input_length: 256 max_output_length: 512 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: /workspace/zzx/GLM-4/finetune_demo/outputs2 max_steps: 1000 # 3000

needed to be fit for the dataset

learning_rate: 5e-5

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 16 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: configs/ds_zero_2.json

错误log /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/optim/adamw.py:609 in │ │ _multi_tensor_adamw │ │ │ │ 606 │ │ │ │ # Use the max. for normalizing running avg. of gradient │ │ 607 │ │ │ │ exp_avg_sq_sqrt = torch._foreach_sqrt(device_max_exp_avg_sqs) │ │ 608 │ │ │ else: │ │ ❱ 609 │ │ │ │ exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) │ │ 610 │ │ │ │ │ 611 │ │ │ torch._foreachdiv(exp_avg_sq_sqrt, bias_correction2_sqrt) │ │ 612 │ │ │ torch._foreachadd(exp_avg_sq_sqrt, eps) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0%| | 0/1000 [00:04<?, ?it/s]

zzx528 commented 2 months ago

huggingface/transformers#24643 microsoft/DeepSpeed#2746 还有,只有一张卡不要用ds 官方demo中ds是8卡的

两张80G的A100也会爆显存

运行指令: CUDA_VISIBLE_DEVICES=1,3 torchrun --nnodes=1 --nproc_per_node=2 finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /roo t/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml

脚本: data_config: train_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl val_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl test_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl num_proc: 1 max_input_length: 256 max_output_length: 512 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: /workspace/zzx/GLM-4/finetune_demo/outputs2 max_steps: 1000 # 3000

needed to be fit for the dataset

learning_rate: 5e-5

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 16 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: configs/ds_zero_2.json

运行log: rank1: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1884 │ rank1: │ in step │ rank1: │ │ rank1: │ 1881 │ │ │ │ if partition_id == dist.get_world_size(group=self.real_dp_process_grouprank1: │ 1882 │ │ │ │ │ single_grad_partition = self.flatten_dense_tensors_aligned( │ rank1: │ 1883 │ │ │ │ │ │ self.averaged_gradients[i], │ rank1: │ ❱ 1884 │ │ │ │ │ │ int(self.partition_size[i])).to(self.single_partition_of_fp32_gr │ rank1: │ 1885 │ │ │ │ else: │ rank1: │ 1886 │ │ │ │ │ single_grad_partition = self.flatten(self.averaged_gradients[i]).to( │ rank1: │ 1887 │ │ │ │ │ │ self.single_partition_of_fp32_groups[i].dtype) │

rank1: OutOfMemoryError: CUDA out of memory. Tried to allocate 17.51 GiB. GPU has a total capacity of 79.15 GiB of which 17.19 GiB is rank1: free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 52.54 GiB is allocated by rank1: PyTorch, and 8.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting rank1: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

0%| | 0/1000 [00:10<?, ?it/s] E0702 09:26:27.811000 140162277397696 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 38950) of binary: /opt/conda/envs/GLM4/bin/python Traceback (most recent call last): File "/opt/conda/envs/GLM4/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures: [1]: time : 2024-07-02_09:26:27 host : e5b62fd1cfab rank : 1 (local_rank: 1) exitcode : 1 (pid: 38951) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-07-02_09:26:27 host : e5b62fd1cfab rank : 0 (local_rank: 0) exitcode : 1 (pid: 38950) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

zRzRzRzRzRzRzR commented 2 months ago

对啊,所以微调脚本用了8卡A100呀....

elesun2018 commented 1 week ago

@zzx528 问题解决了吗,啥原因,我也遇到了相同的问题 deepspeed训练添加 deepspeed: configs/ds_zero_3.json offload_optimizer offload_param 出现以下问题 RuntimeError: 'weight' must be 2-D