使用默认的ds_config.json配置文件,只修改了wandb部分为false(因为慢),然后就发现显存分配了却不开始训练(卡在Using /root/.cache/torch_extensions as PyTorch extensions root...)
于是清空root/.cache后再重新训练,就发现报错了,error信息如下
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py38_cu116/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
FAILED: multi_tensor_adam.cuda.o
=/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
/bin/sh: 1: =/usr/local/cuda-11.6/bin/nvcc: not found
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
[2023-04-21 17:47:56,170] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-04-21 17:47:56,315] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fusedadam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, *super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fusedadam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(super_args, super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fusedadam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2023-04-21 17:48:12,493] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105683
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105684
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105685
[2023-04-21 17:48:12,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105686
[2023-04-21 17:48:12,847] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--output_dir', '/root/nas-private/output', '--init_ckpt', '/root/nas-private/llama-7B-init-ckpt', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '1024', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '4', '--model_parallel_size', '1', '--use_flash_attn', 'true', '--deepspeed_config', './configs/ds_config.json'] exits with return code = 1
使用默认的ds_config.json配置文件,只修改了wandb部分为false(因为慢),然后就发现显存分配了却不开始训练(卡在Using /root/.cache/torch_extensions as PyTorch extensions root...) 于是清空root/.cache后再重新训练,就发现报错了,error信息如下
Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py38_cu116/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o /bin/sh: 1: =/usr/local/cuda-11.6/bin/nvcc: not found Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... [2023-04-21 17:47:56,170] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown [2023-04-21 17:47:56,315] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , _ = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'fused_adam'
Loading extension module fusedadam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, *super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fusedadam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init( super_args, super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Loading extension module fusedadam...
Traceback (most recent call last):
File "train.py", line 143, in
main()
File "train.py", line 109, in main
engine, , , = deepspeed.initialize(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize
engine = PipelineEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init
super().init(*super_args, **super_kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init
fused_adam_cuda = FusedAdamBuilder().load()
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2023-04-21 17:48:12,493] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105683
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105684
[2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105685
[2023-04-21 17:48:12,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105686
[2023-04-21 17:48:12,847] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--output_dir', '/root/nas-private/output', '--init_ckpt', '/root/nas-private/llama-7B-init-ckpt', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '1024', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '4', '--model_parallel_size', '1', '--use_flash_attn', 'true', '--deepspeed_config', './configs/ds_config.json'] exits with return code = 1