baichuan-inc / Baichuan-7B

A large-scale 7B pretraining language model developed by BaiChuan-Inc.
https://huggingface.co/baichuan-inc/baichuan-7B
Apache License 2.0
5.67k stars 506 forks source link

[Question] #116

Closed wqmoran closed 1 year ago

wqmoran commented 1 year ago

Required prerequisites

Questions

`[2023-07-26 10:50:43,488] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-07-26 10:50:43,527] [INFO] [runner.py:541:main] cmd = /root/anaconda3/envs/baichuan2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed --deepspeed_config config/deepspeed.json [2023-07-26 10:50:45,785] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-07-26 10:50:45,785] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-07-26 10:50:45,785] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-07-26 10:50:45,785] [INFO] [launch.py:247:main] dist_world_size=4 [2023-07-26 10:50:45,785] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-07-26 10:50:49,732] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-07-26 10:51:51,332] [INFO] [partition_parameters.py:454:exit] finished initializing model with 7.00B parameters [2023-07-26 10:51:51,332] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown [2023-07-26 10:51:52,467] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...

Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu117/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /root/anaconda3/envs/baichuan2/bin/nvcc -ccbin /root/anaconda3/envs/baichuan2/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include/TH -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include/THC -isystem /root/anaconda3/envs/baichuan2/include -isystem /root/anaconda3/envs/baichuan2/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++17 -c /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /root/anaconda3/envs/baichuan2/bin/nvcc -ccbin /root/anaconda3/envs/baichuan2/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include/TH -isystem /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/include/THC -isystem /root/anaconda3/envs/baichuan2/include -isystem /root/anaconda3/envs/baichuan2/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 -std=c++17 -c /root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o

: fatal error: cuda_runtime.h: No such file or directory compilation terminated. ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/baichuan2/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/mordisk/models/baichuan/baichuan-7B/train.py", line 138, in model_engine = prepare_model() File "/mordisk/models/baichuan/baichuan-7B/train.py", line 117, in prepare_model model_engine, _, _, _ = deepspeed.initialize(args=args, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer optimizer = FusedAdam( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in __init__ fused_adam_cuda = FusedAdamBuilder().load() File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library Loading extension module fused_adam...Loading extension module fused_adam... _run_ninja_build( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build Traceback (most recent call last): Traceback (most recent call last): File "/mordisk/models/baichuan/baichuan-7B/train.py", line 138, in File "/mordisk/models/baichuan/baichuan-7B/train.py", line 138, in model_engine = prepare_model() model_engine = prepare_model() File "/mordisk/models/baichuan/baichuan-7B/train.py", line 117, in prepare_model File "/mordisk/models/baichuan/baichuan-7B/train.py", line 117, in prepare_model raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_adam' model_engine, _, _, _ = deepspeed.initialize(args=args,model_engine, _, _, _ = deepspeed.initialize(args=args, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize engine = DeepSpeedEngine(args=args,engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__ File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__ self._configure_optimizer(optimizer, model_parameters)self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer optimizer = FusedAdam( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in __init__ optimizer = FusedAdam( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in __init__ fused_adam_cuda = FusedAdamBuilder().load() File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load fused_adam_cuda = FusedAdamBuilder().load() File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load return self.jit_load(verbose) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load op_module = load(name=self.name, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile return _jit_compile( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library return _import_module_from_library(name, build_directory, is_python_module) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 571, in module_from_spec File "", line 1176, in create_module module = importlib.util.module_from_spec(spec) File "", line 571, in module_from_spec File "", line 241, in _call_with_frames_removed ImportError: File "", line 1176, in create_module /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory File "", line 241, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory Loading extension module fused_adam... Traceback (most recent call last): File "/mordisk/models/baichuan/baichuan-7B/train.py", line 138, in model_engine = prepare_model() File "/mordisk/models/baichuan/baichuan-7B/train.py", line 117, in prepare_model model_engine, _, _, _ = deepspeed.initialize(args=args, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 308, in __init__ self._configure_optimizer(optimizer, model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1224, in _configure_basic_optimizer optimizer = FusedAdam( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in __init__ fused_adam_cuda = FusedAdamBuilder().load() File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/root/anaconda3/envs/baichuan2/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 571, in module_from_spec File "", line 1176, in create_module File "", line 241, in _call_with_frames_removed ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory [2023-07-26 10:51:54,850] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9089 [2023-07-26 10:51:54,850] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9090 [2023-07-26 10:51:54,865] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9091 [2023-07-26 10:51:54,878] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 9092 [2023-07-26 10:51:54,890] [ERROR] [launch.py:434:sigkill_handler] ['/root/anaconda3/envs/baichuan2/bin/python', '-u', 'train.py', '--local_rank=3', '--deepspeed', '--deepspeed_config', 'config/deepspeed.json'] exits with return code = 1` 在执行 scripts/train.sh 时遇到了这个错误,除了 torch 是 2.0.1,其余都和文档要求一样,CUDA 是 11.7,使用的是 CentOS7.9,用的 4 张 T4,希望有大佬帮忙看看。 ### Checklist - [X] I have provided all relevant and necessary information above. - [X] I have chosen a suitable title for this issue.