Closed PineappleWill closed 1 month ago
Hi, you could try upgrading to Qwen1.5 first and follow the instructions there. But based on the logs, your environment was simply broken (multiple system CUDA existed but appeared poorly configured and deepspeed complained NCCL not implemented, which should not happen). Honestly, it is not our place to debug your environment, but I would suggest a clean install or using the provided docker image.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect)Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect)
===================================BUG REPORT===================================
===================================BUG REPORT======================================================================BUG REPORT======================================================================BUG REPORT======================================================================BUG REPORT======================================================================BUG REPORT===================================Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues================================================================================ and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
================================================================================================================================================================ ===================================================================================================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
===================================BUG REPORT=================================== and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issuesWelcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
================================================================================and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================ bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... bin /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /opt/conda did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /opt/conda/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary /opt/conda/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so... [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,704] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None [2024-05-30 16:20:02,705] [INFO] [comm.py:594:init_distributed] cdb=None The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... Try importing flash-attention for faster inference... The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference...
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.05s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.06s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.07s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.07s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:07, 1.14s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:08, 1.15s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:08, 1.16s/it] Loading checkpoint shards: 12%|█▎ | 1/8 [00:01<00:08, 1.17s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.83s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.84s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.84s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.82s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.82s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.83s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:10, 1.83s/it] Loading checkpoint shards: 25%|██▌ | 2/8 [00:03<00:11, 1.85s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:10, 2.00s/it] Loading checkpoint shards: 38%|███▊ | 3/8 [00:05<00:09, 1.99s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 50%|█████ | 4/8 [00:07<00:07, 1.95s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.97s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 62%|██████▎ | 5/8 [00:09<00:05, 1.98s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.91s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.91s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 75%|███████▌ | 6/8 [00:11<00:03, 1.92s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 88%|████████▊ | 7/8 [00:13<00:01, 1.96s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.44s/it] Loading checkpoint shards: 100%|██████████| 8/8 [00:13<00:00, 1.71s/it] Loading data... Formatting inputs...Skip in lazy mode
期望行为 | Expected Behavior
正常运行
复现方法 | Steps To Reproduce
bash finetune/finetune_ds.sh
运行环境 | Environment
备注 | Anything else?
有时候提交可以,有时候提交不行,全看运气。。