HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
Apache License 2.0
208 stars 18 forks source link

四卡训7B-llama清空缓存再训练报错 #20

Closed Ulov888 closed 1 year ago

Ulov888 commented 1 year ago

使用默认的ds_config.json配置文件,只修改了wandb部分为false(因为慢),然后就发现显存分配了却不开始训练(卡在Using /root/.cache/torch_extensions as PyTorch extensions root...) 于是清空root/.cache后再重新训练,就发现报错了,error信息如下

Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Creating extension directory /root/.cache/torch_extensions/py38_cu116/fused_adam... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py38_cu116/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o =/usr/local/cuda-11.6/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o /bin/sh: 1: =/usr/local/cuda-11.6/bin/nvcc: not found Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... [2023-04-21 17:47:56,170] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown [2023-04-21 17:47:56,315] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... Using /root/.cache/torch_extensions/py38_cu116 as PyTorch extensions root... [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem =/usr/local/cuda-11.6/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 143, in main() File "train.py", line 109, in main engine, , , _ = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init super().init(*super_args, super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer optimizer = FusedAdam( File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile _write_ninja_file_and_build_library( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library _run_ninja_build( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' Loading extension module fusedadam... Traceback (most recent call last): File "train.py", line 143, in main() File "train.py", line 109, in main engine, , , = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init super().init(*super_args, *super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer optimizer = FusedAdam( File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 556, in module_from_spec File "", line 1166, in create_module File "", line 219, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory Loading extension module fusedadam... Traceback (most recent call last): File "train.py", line 143, in main() File "train.py", line 109, in main engine, , , = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init super().init(super_args, super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer optimizer = FusedAdam( File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 556, in module_from_spec File "", line 1166, in create_module File "", line 219, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory Loading extension module fusedadam... Traceback (most recent call last): File "train.py", line 143, in main() File "train.py", line 109, in main engine, , , = deepspeed.initialize( File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 180, in initialize engine = PipelineEngine(args=args, File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 53, in init super().init(*super_args, **super_kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1156, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1222, in _configure_basic_optimizer optimizer = FusedAdam( File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/fused_adam.py", line 71, in init fused_adam_cuda = FusedAdamBuilder().load() File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/usr/local/lib/python3.8/dist-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 556, in module_from_spec File "", line 1166, in create_module File "", line 219, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py38_cu116/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory [2023-04-21 17:48:12,493] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105683 [2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105684 [2023-04-21 17:48:12,710] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105685 [2023-04-21 17:48:12,845] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 105686 [2023-04-21 17:48:12,847] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=3', '--output_dir', '/root/nas-private/output', '--init_ckpt', '/root/nas-private/llama-7B-init-ckpt', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '1024', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '4', '--model_parallel_size', '1', '--use_flash_attn', 'true', '--deepspeed_config', './configs/ds_config.json'] exits with return code = 1

Ulov888 commented 1 year ago

pytorch缓存问题,重新安装即可