[BUG] <title>全参数微调qwen-14b-chat时卡住 - Githubissues

QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

Apache License 2.0

12.47k stars 1.01k forks source link

[BUG] <title>全参数微调qwen-14b-chat时卡住 #1270

Closed PineappleWill closed 1 month ago

PineappleWill commented 1 month ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================

===================================BUG REPORT======================================================================BUG REPORT======================================================================BUG REPORT======================================================================BUG REPORT======================================================================BUG REPORT===================================Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues================================================================================ and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

================================================================================and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

================================================================================

================================================================================================================================================================ ===================================================================================================================BUG REPORT===================================

Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

===================================BUG REPORT=================================== and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issuesWelcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

================================================================================and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

================================================================================ bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference...

Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.05s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.06s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.07s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.07s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.14s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:08, 1.15s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:08, 1.16s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:08, 1.17s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.83s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.84s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.84s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.82s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.82s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.83s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.83s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.85s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:10, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.97s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.91s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.91s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]

Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it] Loading data... Formatting inputs...Skip in lazy mode

期望行为 | Expected Behavior

正常运行

复现方法 | Steps To Reproduce

bash finetune/finetune_ds.sh

运行环境 | Environment

- OS:
- Python:3.8.16
- Transformers:'4.32.0'
- PyTorch:2.0.1+cu117
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

有时候提交可以，有时候提交不行，全看运气。。

jklj077 commented 1 month ago

Hi, you could try upgrading to Qwen1.5 first and follow the instructions there. But based on the logs, your environment was simply broken (multiple system CUDA existed but appeared poorly configured and deepspeed complained NCCL not implemented, which should not happen). Honestly, it is not our place to debug your environment, but I would suggest a clean install or using the provided docker image.