OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.22k stars 821 forks source link

FAILED: custom_cuda_kernel.cuda.o ;nvcc fatal : Unsupported gpu architecture 'compute_89' ; subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.;AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' #472

Closed huizhilei closed 1 year ago

huizhilei commented 1 year ago

Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/lmflow3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -c /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o FAILED: custom_cuda_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/lmflow3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -UCUDA_NO_HALF_OPERATORS -UCUDA_NO_HALF_CONVERSIONS -UCUDA_NO_HALF2_OPERATORS -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -c /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o nvcc fatal : Unsupported gpu architecture 'compute_89' [2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/lmflow3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -D__ENABLE_CUDA -c /root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/lmflow3/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/root/LMFlow/examples/finetune.py", line 61, in main() File "/root/LMFlow/examples/finetune.py", line 57, in main tuned_model = finetuner.tune(model=model, dataset=dataset) File "/root/LMFlow/src/lmflow/pipeline/finetuner.py", line 285, in tune train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train return inner_training_loop( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeedengine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer optimizer = DeepSpeedCPUAdam(model_parameters, File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init self.ds_opt_adam = CPUAdamBuilder().load() File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam' Loading extension module cpu_adam... Traceback (most recent call last): File "/root/LMFlow/examples/finetune.py", line 61, in main() File "/root/LMFlow/examples/finetune.py", line 57, in main tuned_model = finetuner.tune(model=model, dataset=dataset) File "/root/LMFlow/src/lmflow/pipeline/finetuner.py", line 285, in tune train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/transformers/trainer.py", line 1639, in train return inner_training_loop( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/transformers/trainer.py", line 1708, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeedengine, optimizer, , lr_scheduler = deepspeed.initialize(kwargs) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer optimizer = DeepSpeedCPUAdam(model_parameters, File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in init self.ds_opt_adam = CPUAdamBuilder().load() File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 565, in module_from_spec File "", line 1173, in create_module File "", line 228, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py39_cu117/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7fce946893a0> Traceback (most recent call last): File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del self.ds_opt_adam.destroy_adam(self.opt_id) AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' Exception ignored in: <function DeepSpeedCPUAdam.del at 0x7fef501d23a0> Traceback (most recent call last): File "/root/anaconda3/envs/lmflow3/lib/python3.9/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in del AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' [2023-06-06 07:59:27,351] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 21474 [2023-06-06 07:59:27,354] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 21475 [2023-06-06 07:59:27,354] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/lmflow3/bin/python', '-u', 'examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'facebook/galactica-1.3b', '--dataset_path', '/root/LMFlow/data/alpaca/train', '--output_dir', '/root/LMFlow/output_models/finetune_with_lora', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '512', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', 'configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

shizhediao commented 1 year ago

Hi, Please refer to this issue and see whether it could solve this problem. https://github.com/OptimalScale/LMFlow/issues/446 Thanks!

zhengzzj commented 1 year ago

facing same issue. Have you solved this problem?