CoinCheung / gdGPT

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Apache License 2.0
90 stars 8 forks source link

ninja -v指令出错导致transformer_inference.so文件缺失 #12

Open Debouter opened 1 year ago

Debouter commented 1 year ago

Hi~ 我在运行demo.py时出现了以下Error:

Traceback (most recent call last):
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
    ......
ImportError: /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

我初步认为这是ninja -v指令执行存在问题,导致共享目标文件transformer_inference.so没有生成。

我已经尝试了网上解决Command '['ninja', '-v']' returned non-zero exit status 1的各种方法,例如安装或禁用ninja库、降低pytorch版本等,但都无法解决这个问题。

我使用的环境如下:

请问你是否遇到过这个问题?如果没有的话可否分享一下你的transformer_inference.so文件,该文件大概在路径/.cache/torch_extensions/pyXX_cuXX/transformer_inference处。

谢谢!

CoinCheung commented 1 year ago

Hi,

Would you post your full error message? I do not have this problem.

Debouter commented 1 year ago

Here is the whole stack trace. Btw, could u please tell me the versions of GCC and Ninja you use?

Using /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Traceback (most recent call last):
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 64, in <module>
    res = infer_with_deepspeed(model_name, prompt)
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 40, in infer_with_deepspeed
    model.model = deepspeed.init_inference(model.model, config=infer_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 192, in __init__
    self._apply_injection_policy(config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 523, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 766, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 823, in _replace_module
    Loading extension module transformer_inference...replaced_module = policies[child.__class__][0](child,

  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 500, in replace_fn
Traceback (most recent call last):
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 64, in <module>
    new_module = replace_with_policy(child,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 348, in replace_with_policy
res = infer_with_deepspeed(model_name, prompt)
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 40, in infer_with_deepspeed
    _container.create_module()
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/containers/bloom.py", line 30, in create_module
model.model = deepspeed.init_inference(model.model, config=infer_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    self.module = DeepSpeedBloomInference(_config, mp_group=self.mp_group)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_bloom.py", line 20, in __init__
engine = InferenceEngine(model, config=ds_inference_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 192, in __init__
    super().__init__(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 58, in __init__
self._apply_injection_policy(config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _apply_injection_policy
    inference_module = builder.load()
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 523, in replace_transformer_layer
    return self.jit_load(verbose)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
replaced_module = replace_module(model=model,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 766, in replace_module
op_module = load(name=self.name,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
return _jit_compile(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _, layer_id = _replace_module(child,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
_write_ninja_file_and_build_library(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _, layer_id = _replace_module(child,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 823, in _replace_module
_run_ninja_build(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    replaced_module = policies[child.__class__][0](child,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 500, in replace_fn
    raise RuntimeError(message) from e
    RuntimeErrornew_module = replace_with_policy(child,: Error building extension 'transformer_inference'

  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 348, in replace_with_policy
    _container.create_module()
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/containers/bloom.py", line 30, in create_module
    self.module = DeepSpeedBloomInference(_config, mp_group=self.mp_group)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_bloom.py", line 20, in __init__
    super().__init__(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 58, in __init__
    inference_module = builder.load()
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
    return self.jit_load(verbose)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
    op_module = load(name=self.name,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory
CoinCheung commented 1 year ago

Hi,

the output of running ninja --version on my machine is :

1.11.1.git.kitware.jobserver-1

and the output of running gcc -v on my machine is:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 

Would you rm -rf /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118 and try again?

Debouter commented 1 year ago

Well, I have fixed it by adjusting the version of gcc to match yours, removing the file u mentioned above, and setting export TORCH_EXTENSIONS_DIR=/tmp according to https://github.com/microsoft/DeepSpeed/issues/3356.

Though similar problems occasionally occur during other installations, it works fine in this repo. Anyway, thanks a lot!