Closed zzx528 closed 2 months ago
给个完整的报错,以及硬件配置,未能复现
给个完整的报错,以及硬件配置,未能复现
accelerate 0.31.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 asttokens 2.4.1 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 blinker 1.8.2 cachetools 5.3.3 certifi 2024.6.2 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.5.1 comm 0.2.2 contourpy 1.2.1 cycler 0.12.1 dataclasses-json 0.6.7 datasets 2.20.0 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.14.3 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 dnspython 2.6.1 einops 0.8.0 email_validator 2.1.2 exceptiongroup 1.2.1 executing 2.0.1 fastapi 0.111.0 fastapi-cli 0.0.4 ffmpy 0.3.2 filelock 3.15.1 fonttools 4.53.0 frozenlist 1.4.1 fsspec 2024.5.0 gitdb 4.0.11 GitPython 3.1.43 gradio 4.36.1 gradio_client 1.0.1 greenlet 3.0.3 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 idna 3.7 importlib_resources 6.4.0 interegular 0.3.3 ipykernel 6.29.4 ipython 8.25.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.4 joblib 1.4.2 jsonpatch 1.33 jsonpointer 3.0.0 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.2 jupyter_core 5.7.2 kiwisolver 1.4.5 langchain 0.2.5 langchain-community 0.2.5 langchain-core 0.2.8 langchain-text-splitters 0.2.1 langsmith 0.1.78 lark 1.1.9 llvmlite 0.43.0 lm-format-enforcer 0.10.1 lxml 5.2.2 markdown-it-py 3.0.0 MarkupSafe 2.1.5 marshmallow 3.21.3 matplotlib 3.9.0 matplotlib-inline 0.1.7 mdurl 0.1.2 mpi4py 3.1.4 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 mypy-extensions 1.0.0 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.8.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 openai 1.34.0 orjson 3.10.5 outlines 0.0.45 packaging 24.1 pandas 2.2.2 parso 0.8.4 peft 0.11.1 pexpect 4.9.0 pillow 10.3.0 pip 24.0 platformdirs 4.2.2 prometheus_client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 prompt_toolkit 3.0.47 protobuf 4.25.3 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 16.1.0 pyarrow-hotfix 0.6 pycountry 24.6.1 pydantic 2.7.4 pydantic_core 2.18.4 pydeck 0.9.1 pydub 0.25.1 Pygments 2.18.0 PyJWT 2.8.0 PyMuPDF 1.24.5 PyMuPDFb 1.24.3 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-docx 1.1.2 python-dotenv 1.0.1 python-multipart 0.0.9 python-pptx 0.6.23 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.0.3 ray 2.24.0 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.18.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.4.9 safetensors 0.4.3 scikit-learn 1.5.0 scipy 1.13.1 semantic-version 2.10.0 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 69.5.1 shellingham 1.5.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 SQLAlchemy 2.0.30 sse-starlette 2.1.2 stack-data 0.6.3 starlette 0.37.2 streamlit 1.35.0 sympy 1.12.1 tenacity 8.4.1 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.3 tokenizers 0.19.1 toml 0.10.2 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.0 torchvision 0.18.1 tornado 6.4.1 tqdm 4.66.4 traitlets 5.14.3 transformers 4.40.0 triton 2.3.0 typer 0.12.3 typing_extensions 4.12.2 typing-inspect 0.9.0 tzdata 2024.1 ujson 5.10.0 urllib3 2.2.1 uvicorn 0.30.1 uvloop 0.19.0 vllm 0.5.0.post1 vllm-flash-attn 2.5.9 watchdog 4.0.1 watchfiles 0.22.0 wcwidth 0.2.13 websockets 11.0.3 wheel 0.43.0 xformers 0.0.26.post1 XlsxWriter 3.2.0 xxhash 3.4.1 yarl 1.9.4 zhipuai 2.1.0.20240521
sft.yaml data_config: train_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl val_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl test_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args:
transformers.Seq2SeqTrainingArguments
output_dir: ./output max_steps: 3000
learning_rate: 5e-5
per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false
save_strategy: steps save_steps: 500
log_level: info logging_strategy: steps logging_steps: 10
per_device_eval_batch_size: 16 evaluation_strategy: steps eval_steps: 500
predict_with_generate: true generation_config: max_new_tokens: 512
deepspeed: configs/ds_zero_3.json
运行命令: CUDA_VISIBLE_DEVICES=3 python finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /root/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml
部分出错日志: rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/module.py:1541 in _call_impl │ rank0: │ │ rank0: │ 1538 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ rank0: │ 1539 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ rank0: │ 1540 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ rank0: │ ❱ 1541 │ │ │ return forward_call(*args, kwargs) │ rank0: │ 1542 │ │ │ rank0: │ 1543 │ │ try: │ rank0: │ 1544 │ │ │ result = None │ rank0: │ │ rank0: │ /root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat/modeling_chatglm.py:823 in │ rank0: │ forward │ rank0: │ │ rank0: │ 820 │ │ rank0: │ 821 │ def forward(self, input_ids): │ rank0: │ 822 │ │ # Embeddings. │ rank0: │ ❱ 823 │ │ words_embeddings = self.word_embeddings(input_ids) │ rank0: │ 824 │ │ embeddings = words_embeddings │ rank0: │ 825 │ │ # If the input flag for fp32 residual connection is set, convert for float. │ rank0: │ 826 │ │ if self.fp32_residual_connection: │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/module.py:1532 in │ rank0: │ _wrapped_call_impl │ rank0: │ │ rank0: │ 1529 │ │ if self._compiled_call_impl is not None: │ rank0: │ 1530 │ │ │ return self._compiled_call_impl(*args, *kwargs) # type: ignore[misc] │ rank0: │ 1531 │ │ else: │ rank0: │ ❱ 1532 │ │ │ return self._call_impl(args, kwargs) │ rank0: │ 1533 │ │ rank0: │ 1534 │ def _call_impl(self, *args, *kwargs): │ rank0: │ 1535 │ │ forward_call = (self._slow_forward if torch._C._get_tracing_state() else self.fo │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/module.py:1582 in _call_impl │ rank0: │ │ rank0: │ 1579 │ │ │ │ bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hoo │ rank0: │ 1580 │ │ │ │ args = bw_hook.setup_input_hook(args) │ rank0: │ 1581 │ │ │ │ rank0: │ ❱ 1582 │ │ │ result = forward_call(args, *kwargs) │ rank0: │ 1583 │ │ │ if _global_forward_hooks or self._forward_hooks: │ rank0: │ 1584 │ │ │ │ for hook_id, hook in ( │ rank0: │ 1585 │ │ │ │ │ _global_forward_hooks.items(), │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/modules/sparse.py:163 in forward │ rank0: │ │ rank0: │ 160 │ │ │ │ self.weight[self.paddingidx].fill(0) │ rank0: │ 161 │ │ rank0: │ 162 │ def forward(self, input: Tensor) -> Tensor: │ rank0: │ ❱ 163 │ │ return F.embedding( │ rank0: │ 164 │ │ │ input, self.weight, self.padding_idx, self.max_norm, │ rank0: │ 165 │ │ │ self.norm_type, self.scale_grad_by_freq, self.sparse) │ rank0: │ 166 │ rank0: │ │ rank0: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/nn/functional.py:2264 in embedding │ rank0: │ │ rank0: │ 2261 │ │ # torch.embeddingrenorm │ rank0: │ 2262 │ │ # remove once script supports set_grad_enabled │ rank0: │ 2263 │ │ _no_grad_embeddingrenorm(weight, input, max_norm, norm_type) │ rank0: │ ❱ 2264 │ return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) │ rank0: │ 2265 │ rank0: │ 2266 │ rank0: │ 2267 def embedding_bag( │
rank0: RuntimeError: 'weight' must be 2-D
https://github.com/huggingface/transformers/issues/24643 https://github.com/microsoft/DeepSpeed/issues/2746 还有,只有一张卡不要用ds 官方demo中ds是8卡的
huggingface/transformers#24643 microsoft/DeepSpeed#2746 还有,只有一张卡不要用ds 官方demo中ds是8卡的
一张80G的A100会爆显存
运行指令: CUDA_VISIBLE_DEVICES=3 python finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /root/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml
运行脚本,去掉了ds,会爆显存 data_config: train_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl val_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl test_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl num_proc: 1 max_input_length: 256 max_output_length: 512 training_args:
transformers.Seq2SeqTrainingArguments
output_dir: /workspace/zzx/GLM-4/finetune_demo/outputs2 max_steps: 1000 # 3000
learning_rate: 5e-5
per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false
save_strategy: steps save_steps: 500
log_level: info logging_strategy: steps logging_steps: 10
per_device_eval_batch_size: 16 evaluation_strategy: steps eval_steps: 500
predict_with_generate: true generation_config: max_new_tokens: 512
错误log /opt/conda/envs/GLM4/lib/python3.10/site-packages/torch/optim/adamw.py:609 in │ │ _multi_tensor_adamw │ │ │ │ 606 │ │ │ │ # Use the max. for normalizing running avg. of gradient │ │ 607 │ │ │ │ exp_avg_sq_sqrt = torch._foreach_sqrt(device_max_exp_avg_sqs) │ │ 608 │ │ │ else: │ │ ❱ 609 │ │ │ │ exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs) │ │ 610 │ │ │ │ │ 611 │ │ │ torch._foreachdiv(exp_avg_sq_sqrt, bias_correction2_sqrt) │ │ 612 │ │ │ torch._foreachadd(exp_avg_sq_sqrt, eps) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0%| | 0/1000 [00:04<?, ?it/s]
huggingface/transformers#24643 microsoft/DeepSpeed#2746 还有,只有一张卡不要用ds 官方demo中ds是8卡的
两张80G的A100也会爆显存
运行指令: CUDA_VISIBLE_DEVICES=1,3 torchrun --nnodes=1 --nproc_per_node=2 finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /roo t/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml
脚本: data_config: train_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl val_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl test_file: /workspace/zzx/GLM-4/finetune_demo/datasets/train.jsonl num_proc: 1 max_input_length: 256 max_output_length: 512 training_args:
transformers.Seq2SeqTrainingArguments
output_dir: /workspace/zzx/GLM-4/finetune_demo/outputs2 max_steps: 1000 # 3000
learning_rate: 5e-5
per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false
save_strategy: steps save_steps: 500
log_level: info logging_strategy: steps logging_steps: 10
per_device_eval_batch_size: 16 evaluation_strategy: steps eval_steps: 500
predict_with_generate: true generation_config: max_new_tokens: 512
deepspeed: configs/ds_zero_2.json
运行log: rank1: │ /opt/conda/envs/GLM4/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1884 │ rank1: │ in step │ rank1: │ │ rank1: │ 1881 │ │ │ │ if partition_id == dist.get_world_size(group=self.real_dp_process_grouprank1: │ 1882 │ │ │ │ │ single_grad_partition = self.flatten_dense_tensors_aligned( │ rank1: │ 1883 │ │ │ │ │ │ self.averaged_gradients[i], │ rank1: │ ❱ 1884 │ │ │ │ │ │ int(self.partition_size[i])).to(self.single_partition_of_fp32_gr │ rank1: │ 1885 │ │ │ │ else: │ rank1: │ 1886 │ │ │ │ │ single_grad_partition = self.flatten(self.averaged_gradients[i]).to( │ rank1: │ 1887 │ │ │ │ │ │ self.single_partition_of_fp32_groups[i].dtype) │
rank1: OutOfMemoryError: CUDA out of memory. Tried to allocate 17.51 GiB. GPU has a total capacity of 79.15 GiB of which 17.19 GiB is rank1: free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 52.54 GiB is allocated by rank1: PyTorch, and 8.75 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting rank1: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management
对啊,所以微调脚本用了8卡A100呀....
@zzx528 问题解决了吗,啥原因,我也遇到了相同的问题 deepspeed训练添加 deepspeed: configs/ds_zero_3.json offload_optimizer offload_param 出现以下问题 RuntimeError: 'weight' must be 2-D
System Info / 系統信息
accelerate 0.31.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.43.1 certifi 2024.6.2 charset-normalizer 3.3.2 click 8.1.7 contourpy 1.2.1 cycler 0.12.1 datasets 2.20.0 deepspeed 0.14.4 dill 0.3.8 distro 1.9.0 dnspython 2.6.1 einops 0.8.0 email_validator 2.2.0 exceptiongroup 1.2.1 fastapi 0.111.0 fastapi-cli 0.0.4 ffmpy 0.3.2 filelock 3.15.4 fonttools 4.53.0 frozenlist 1.4.1 fsspec 2024.5.0 gradio 4.37.2 gradio_client 1.0.2 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.4 idna 3.7 importlib_resources 6.4.0 jieba 0.42.1 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.22.0 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.0 mdurl 0.1.2 mpi4py 3.1.4 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.3 ninja 1.11.1.1 nltk 3.8.1 numpy 2.0.0 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 openai 1.35.7 orjson 3.10.5 packaging 24.1 pandas 2.2.2 peft 0.11.1 pillow 10.3.0 pip 24.0 psutil 6.0.0 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.4 pydantic_core 2.18.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.18.1 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.5.0 safetensors 0.4.3 scikit-learn 1.5.0 scipy 1.14.0 semantic-version 2.10.0 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 69.5.1 shellingham 1.5.4 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.2 starlette 0.37.2 sympy 1.12.1 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.7 tokenizers 0.19.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.1 torchvision 0.18.1 tqdm 4.66.4 transformers 4.40.0 triton 2.3.1 typer 0.12.3 typing_extensions 4.12.2 tzdata 2024.1 ujson 5.10.0 urllib3 2.2.2 uvicorn 0.30.1 uvloop 0.19.0 watchfiles 0.22.0 websockets 11.0.3 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
CUDA_VISIBLE_DEVICES=0 python finetune.py /workspace/zzx/GLM-4/finetune_demo/datasets /root/.cache/huggingface/big_models/glm-4-9b-chat configs/sft.yaml
Expected behavior / 期待表现
跑通