THUDM / ChatGLM-6B

ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型
Apache License 2.0
40.62k stars 5.21k forks source link

[Help]多卡训练的时候总说cache/torch_extensions/py38_cu113/utils/utils.so: cannot open shared object file: No such file or directory #761

Open shishijier opened 1 year ago

shishijier commented 1 year ago

Is there an existing issue for this?

Current Behavior

Loading extension module utils... Traceback (most recent call last): File "main.py", line 431, in main() File "main.py", line 370, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/disk1/shisj/project/ChatGLM-6B/ptuning/trainer.py", line 1635, in train return inner_training_loop( File "/disk1/shisj/project/ChatGLM-6B/ptuning/trainer.py", line 1704, in _inner_training_loop deepspeed_engine, optimizer, lr_scheduler = deepspeed_init( File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init deepspeedengine, optimizer, , lr_scheduler = deepspeed.initialize(**kwargs) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize engine = DeepSpeedEngine(args=args, File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init self._configure_optimizer(optimizer, model_parameters) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1167, in _configure_optimizer self.optimizer = self._configure_zero_optimizer(basic_optimizer) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1398, in _configure_zero_optimizer optimizer = DeepSpeedZeroOptimizer( File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 154, in init util_ops = UtilsBuilder().load() File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load return self.jit_load(verbose) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/disk1/shisj/anaconda3/envs/glm/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1844, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 556, in module_from_spec File "", line 1101, in create_module File "", line 219, in _call_with_frames_removed ImportError: /disk1/shisj/cache/torch_extensions/py38_cu113/utils/utils.so: cannot open shared object file: No such file or directory

多卡训练,显示找不到utils.so这个文件

Expected Behavior

No response

Steps To Reproduce

Environment

- OS:Centos 7.9.2009
- Python:3.8
- Transformers:4.27.1
- PyTorch:1.12.1
-CUDA:11.3
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :True

Anything else?

No response

Chiang97912 commented 1 year ago

Make sure you have installed ninja, You can install it by conda install ninja

roki1031 commented 1 year ago

我也遇到了同样的问题,请问你现在有解决吗?

roki1031 commented 1 year ago

Make sure you have installed ninja, You can install it by conda install ninja

I run ninja --version and the result is 1.11.1.git.kitware.jobserver-1

cycoe commented 1 year ago

goto the /disk1/shisj/cache/torch_extensions/py38_cu113/utils/ directory, then compile utils.so manully with ninja

eziohzy commented 1 year ago

goto the /disk1/shisj/cache/torch_extensions/py38_cu113/utils/ directory, then compile utils.so manully with ninja

Nothing in this folder; PS, I reinstall ninja, and it worked! still don't know why