Closed alexhmyang closed 1 year ago
could you please provide more log? I think there should be another error before this.
Hi I also get the same error. The log is as follows:
(lmflow) xuyan@black-rack-0:~/LLM/LMFlow$ CUDA_VISIBLE_DEVICES=0 ./scripts/run_finetune.sh "--num_gpus=1 --master_port 10001"
[2023-04-03 14:59:52,961] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-04-03 14:59:55,358] [INFO] [runner.py:550:main] cmd = /home/xuyan/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=10001 --enable_each_rank_log=None examples/finetune.py --model_name_or_path gpt2 --dataset_path /home/xuyan/LLM/LMFlow/data/alpaca/train --output_dir /home/xuyan/LLM/LMFlow/output_models/finetune --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-04-03 14:59:57,679] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-03 14:59:57,680] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-03 14:59:57,680] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-03 14:59:57,680] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-03 14:59:57,680] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-03 15:00:05,633] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/03/2023 15:00:06 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
04/03/2023 15:00:07 - WARNING - datasets.builder - Found cached dataset json (/home/xuyan/.cache/huggingface/datasets/json/default-dda63bbab21e635e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
[2023-04-03 15:00:14,782] [INFO] [partition_parameters.py:415:exit] finished initializing model with 0.16B parameters
/home/xuyan/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:2547: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
warnings.warn(
04/03/2023 15:00:15 - WARNING - datasets.fingerprint - Parameter 'function'=<function HFDecoderModel.tokenize.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/xuyan/LLM/LMFlow/examples/finetune.py", line 70, in
It's better to using the same CUDA version with pytorch, like this:
conda install cuda -c nvidia/label/cuda-11.7.0
It's better to using the same CUDA version with pytorch, like this:
conda install cuda -c nvidia/label/cuda-11.7.0
(lmflow) u20@u20:~/LMFlow/service$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0
cuda 11.6 not work?
It's better to using the same CUDA version with pytorch, like this:
conda install cuda -c nvidia/label/cuda-11.7.0
(lmflow) u20@u20:~/LMFlow/service$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0
cuda 11.6 not work?
I found it's always hard to debug CUDA version related issues...
It works fine on my machine using conda to install 11.7 version CUDA.
Yes you are right. I also found that it is a CUDA-related issue. It seems that CUDA11.0 is too old to run deepspeed. But cuda 11.6 should be fine I think.
Thank you very much for your help!
...
nvcc fatal : Unsupported gpu architecture 'compute_86'
...
According to the log, it is indeed due to the CUDA version problem. It seems nvcc
is not compatible with your GPU. You may try other version of CUDA. Thanks 😄
yes, I have the same error. And I installed cuda -c nvidia/label/cuda-11.7.0. It seems ok now.
It's better to using the same CUDA version with pytorch, like this:
conda install cuda -c nvidia/label/cuda-11.7.0
It's better to using the same CUDA version with pytorch, like this:
conda install cuda -c nvidia/label/cuda-11.7.0
I am using module load gcc/9.2.0 cuda/11.7
But still getting the error
ImportError: /home/xxxxxx/.cache/torch_extensions/py39_cu117/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
It's better to using the same CUDA version with pytorch, like this:
conda install cuda -c nvidia/label/cuda-11.7.0
This solution works for me! Thank you very much for the help! <3
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks!
RuntimeError: Error building extension 'cpu_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7f683231b670> Traceback (most recent call last): File "/home/u20/miniconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 110, in del AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' [2023-04-03 12:50:15,113] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 21626 [2023-04-03 12:50:15,113] [ERROR] [launch.py:324:sigkill_handler] ['/home/u20/miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/home/u20/LMFlow/data/alpaca/train', '--output_dir', '/home/u20/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
error when run ./scripts/run_finetune.sh i have gpu and cuda installed, why it raises cpu error?
./scripts/run_finetune_with_lora.sh also raise same error