OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
560 stars 77 forks source link

Error when pip install bmtrain #137

Closed HBX-hbx closed 1 year ago

HBX-hbx commented 1 year ago

PyTorch 1.13.1 CUDA Version: 11.2

Building wheel for bmtrain (setup.py) ... error ERROR: Command errored out with exit status 1: command: /data/private/hebingxiang/miniconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hcfrmsk4/bmtrain_7495d4a1219f45dc8e9bca0dade5da43/setup.py'"'"'; file='"'"'/tmp/pip-install-hcfrmsk4/bmtrain_7495d4a1219f45dc8e9bca0dade5da43/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-fof5ziro cwd: /tmp/pip-install-hcfrmsk4/bmtrain_7495d4a1219f45dc8e9bca0dade5da43/ Complete output (67 lines): running bdist_wheel /data/private/hebingxiang/miniconda3/lib/python3.9/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend. warnings.warn(msg.format('we could not find ninja.')) running build running build_py creating build creating build/lib.linux-x86_64-3.9 creating build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/debug.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/param_init.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/checkpointing.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/global_var.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/synchronize.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/pipe_layer.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/parameter.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/init.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/utils.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/wrapper.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/block_layer.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/layer.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/init.py -> build/lib.linux-x86_64-3.9/bmtrain copying bmtrain/store.py -> build/lib.linux-x86_64-3.9/bmtrain creating build/lib.linux-x86_64-3.9/bmtrain/nccl copying bmtrain/nccl/enums.py -> build/lib.linux-x86_64-3.9/bmtrain/nccl copying bmtrain/nccl/init.py -> build/lib.linux-x86_64-3.9/bmtrain/nccl creating build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/noam.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/warmup.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/init.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/no_decay.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/exponential.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/linear.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler copying bmtrain/lr_scheduler/cosine.py -> build/lib.linux-x86_64-3.9/bmtrain/lr_scheduler creating build/lib.linux-x86_64-3.9/bmtrain/benchmark copying bmtrain/benchmark/all_gather.py -> build/lib.linux-x86_64-3.9/bmtrain/benchmark copying bmtrain/benchmark/send_recv.py -> build/lib.linux-x86_64-3.9/bmtrain/benchmark copying bmtrain/benchmark/init.py -> build/lib.linux-x86_64-3.9/bmtrain/benchmark copying bmtrain/benchmark/shape.py -> build/lib.linux-x86_64-3.9/bmtrain/benchmark copying bmtrain/benchmark/utils.py -> build/lib.linux-x86_64-3.9/bmtrain/benchmark copying bmtrain/benchmark/reduce_scatter.py -> build/lib.linux-x86_64-3.9/bmtrain/benchmark creating build/lib.linux-x86_64-3.9/bmtrain/optim copying bmtrain/optim/adam_offload.py -> build/lib.linux-x86_64-3.9/bmtrain/optim copying bmtrain/optim/init.py -> build/lib.linux-x86_64-3.9/bmtrain/optim copying bmtrain/optim/optim_manager.py -> build/lib.linux-x86_64-3.9/bmtrain/optim copying bmtrain/optim/adam.py -> build/lib.linux-x86_64-3.9/bmtrain/optim creating build/lib.linux-x86_64-3.9/bmtrain/distributed copying bmtrain/distributed/ops.py -> build/lib.linux-x86_64-3.9/bmtrain/distributed copying bmtrain/distributed/init.py -> build/lib.linux-x86_64-3.9/bmtrain/distributed creating build/lib.linux-x86_64-3.9/bmtrain/loss copying bmtrain/loss/cross_entropy.py -> build/lib.linux-x86_64-3.9/bmtrain/loss copying bmtrain/loss/init.py -> build/lib.linux-x86_64-3.9/bmtrain/loss creating build/lib.linux-x86_64-3.9/bmtrain/inspect copying bmtrain/inspect/model.py -> build/lib.linux-x86_64-3.9/bmtrain/inspect copying bmtrain/inspect/format.py -> build/lib.linux-x86_64-3.9/bmtrain/inspect copying bmtrain/inspect/init.py -> build/lib.linux-x86_64-3.9/bmtrain/inspect copying bmtrain/inspect/tensor.py -> build/lib.linux-x86_64-3.9/bmtrain/inspect running build_ext building 'bmtrain.nccl._C' extension creating build/temp.linux-x86_64-3.9 creating build/temp.linux-x86_64-3.9/csrc gcc -pthread -B /data/private/hebingxiang/miniconda3/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /data/private/hebingxiang/miniconda3/include -I/data/private/hebingxiang/miniconda3/include -fPIC -O2 -isystem /data/private/hebingxiang/miniconda3/include -fPIC -Icsrc/nccl/build/include -I/data/private/hebingxiang/miniconda3/lib/python3.9/site-packages/torch/include -I/data/private/hebingxiang/miniconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/data/private/hebingxiang/miniconda3/lib/python3.9/site-packages/torch/include/TH -I/data/private/hebingxiang/miniconda3/lib/python3.9/site-packages/torch/include/THC -I/data/private/hebingxiang/miniconda3/include -I/data/private/hebingxiang/miniconda3/include/python3.9 -c csrc/nccl.cpp -o build/temp.linux-x86_64-3.9/csrc/nccl.o -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 In file included from csrc/nccl.cpp:4: /data/private/hebingxiang/miniconda3/lib/python3.9/site-packages/torch/include/ATen/cuda/CUDAContext.h:10:10: fatal error: cusolverDn.h: No such file or directory 10 | #include | ^~~~~~ compilation terminated. error: command '/usr/bin/gcc' failed with exit code 1

ERROR: Failed building wheel for bmtrain

MayDomine commented 1 year ago

It seems that your torch can not find cuda header file.Please check your environment variable settings. BMTrain 0.2.3 has got rid of torch when compiling .so file, which means this problem won't happen anymore

HBX-hbx commented 1 year ago

It seems that your torch can not find cuda header file.Please check your environment variable settings. BMTrain 0.2.3 has got rid of torch when compiling .so file, which means this problem won't happen anymore

But 0.2.3 has not been released.