Closed linhaojia13 closed 1 year ago
Hi, thank you for being interested in our work and code. Several of my co-authors also use my codebase and follow README.md to build an environment, and we never see a similar ERROR. Have you made sure your local CUDA version is 11.3, not only the cudatoolkit in Conda, since the building process of pointops
might be based on your system default CUDA? You can use nvcc -V
to check the version.
Thank you for your reply. I have made sure the local CUDA used for building pointops
is 11.3.
It's a bit weird. Could you show me the output of conda env export
?
Hi, I googled the error and it seems to be led by a mismatch of NVIDIA driver, CUDA, and cudnn. And we can see that the error was not from pointops
, it was reported by PyTorch
.
Thank you for your kind help!
I use nvidia-smi
and it shows that the NVIDIA driver is 450.80.02
. As shown in the nvidia docs, this driver version can support cuda11.3.
The output of conda env export
is as follows:
name: pcr
channels:
- pyg
- anaconda
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_kmp_llvm
- addict=2.4.0=pyhd8ed1ab_2
- blas=1.0=mkl
- brotlipy=0.7.0=py38h0a891b7_1005
- bzip2=1.0.8=h7f98852_4
- ca-certificates=2022.12.7=ha878542_0
- certifi=2022.12.7=pyhd8ed1ab_0
- cffi=1.15.1=py38h4a40e3a_3
- charset-normalizer=2.1.1=pyhd8ed1ab_0
- colorama=0.4.6=pyhd8ed1ab_0
- cryptography=39.0.0=py38h1724139_0
- cudatoolkit=11.3.1=h9edb442_10
- einops=0.6.0=pyhd8ed1ab_0
- ffmpeg=4.3=hf484d3e_0
- filelock=3.9.0=pyhd8ed1ab_0
- freetype=2.10.4=h0708190_1
- gmp=6.2.1=h58526e2_0
- gnutls=3.6.13=h85f3911_1
- h5py=3.7.0=py38h737f45e_0
- hdf5=1.10.6=h3ffc7dd_1
- huggingface_hub=0.11.1=pyhd8ed1ab_0
- idna=3.4=pyhd8ed1ab_0
- importlib-metadata=6.0.0=pyha770c72_0
- importlib_metadata=6.0.0=hd8ed1ab_0
- intel-openmp=2021.4.0=h06a4308_3561
- jbig=2.1=h7f98852_2003
- jpeg=9e=h166bdaf_1
- lame=3.100=h7f98852_1001
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.38=h1181459_1
- lerc=2.2.1=h9c3ff4c_0
- libblas=3.9.0=12_linux64_mkl
- libcblas=3.9.0=12_linux64_mkl
- libdeflate=1.7=h7f98852_5
- libffi=3.4.2=h6a678d5_6
- libgcc-ng=12.2.0=h65d4601_19
- libgfortran-ng=11.2.0=h00389a5_1
- libgfortran5=11.2.0=h1234567_1
- libiconv=1.17=h166bdaf_0
- liblapack=3.9.0=12_linux64_mkl
- libpng=1.6.37=h21135ba_2
- libprotobuf=3.15.8=h780b84a_1
- libstdcxx-ng=11.2.0=h1234567_1
- libtiff=4.3.0=hf544144_1
- libuv=1.43.0=h7f98852_0
- libwebp-base=1.2.2=h7f98852_1
- llvm-openmp=14.0.6=h9e868ea_0
- lz4-c=1.9.3=h9c3ff4c_1
- mkl=2021.4.0=h06a4308_640
- mkl-service=2.4.0=py38h95df7f1_0
- mkl_fft=1.3.1=py38h8666266_1
- mkl_random=1.2.2=py38h1abd341_0
- ncurses=6.3=h5eee18b_3
- nettle=3.6=he412f7d_0
- ninja-base=1.10.2=hd09550d_5
- numpy=1.23.5=py38h14f4228_0
- numpy-base=1.23.5=py38h31eccc5_0
- olefile=0.46=pyh9f0ad1d_1
- openh264=2.1.1=h780b84a_0
- openjpeg=2.4.0=hb52868f_1
- openssl=1.1.1s=h0b41bf4_1
- packaging=23.0=pyhd8ed1ab_0
- pillow=8.3.2=py38h8e6f84c_0
- pip=22.3.1=py38h06a4308_0
- plyfile=0.7.4=pyhd8ed1ab_0
- protobuf=3.15.8=py38h709712a_0
- pycparser=2.21=pyhd8ed1ab_0
- pyopenssl=23.0.0=pyhd8ed1ab_0
- pysocks=1.7.1=pyha2e5f31_6
- python=3.8.15=h7a1cb2a_2
- python_abi=3.8=2_cp38
- pytorch=1.10.1=py3.8_cuda11.3_cudnn8.2.0_0
- pytorch-cluster=1.6.0=py38_torch_1.10.0_cu113
- pytorch-mutex=1.0=cuda
- pytorch-scatter=2.0.9=py38_torch_1.10.0_cu113
- pytorch-sparse=0.6.13=py38_torch_1.10.0_cu113
- pyyaml=6.0=py38h7f8727e_1
- readline=8.2=h5eee18b_0
- requests=2.28.1=pyhd8ed1ab_1
- scipy=1.8.1=py38h1ee437e_0
- setuptools=65.6.3=py38h06a4308_0
- sharedarray=3.2.2=py38h26c90d9_1
- six=1.16.0=pyh6c4a22f_0
- sqlite=3.40.1=h5082296_0
- tensorboardx=2.5.1=pyhd8ed1ab_0
- termcolor=2.2.0=pyhd8ed1ab_0
- timm=0.6.12=pyhd8ed1ab_0
- tk=8.6.12=h1ccaba5_0
- torchaudio=0.10.1=py38_cu113
- torchvision=0.11.2=py38_cu113
- tqdm=4.64.1=pyhd8ed1ab_0
- typing-extensions=4.4.0=hd8ed1ab_0
- typing_extensions=4.4.0=pyha770c72_0
- urllib3=1.26.14=pyhd8ed1ab_0
- wheel=0.37.1=pyhd3eb1b0_0
- xz=5.2.8=h5eee18b_0
- yaml=0.2.5=h7b6447c_0
- yapf=0.32.0=pyhd8ed1ab_0
- zipp=3.11.0=pyhd8ed1ab_0
- zlib=1.2.13=h5eee18b_0
- zstd=1.5.0=ha95c52a_0
- pip:
- asttokens==2.2.1
- attrs==22.2.0
- backcall==0.2.0
- ccimport==0.4.2
- click==8.1.3
- comm==0.1.2
- configargparse==1.5.3
- contourpy==1.0.7
- cumm-cu113==0.3.7
- cycler==0.11.0
- dash==2.7.1
- dash-core-components==2.0.0
- dash-html-components==2.0.0
- dash-table==5.0.0
- debugpy==1.6.5
- decorator==5.1.1
- entrypoints==0.4
- executing==1.2.0
- fastjsonschema==2.16.2
- fire==0.5.0
- flask==2.2.2
- fonttools==4.38.0
- importlib-resources==5.10.2
- ipykernel==6.20.1
- ipython==8.8.0
- ipywidgets==8.0.4
- itsdangerous==2.1.2
- jedi==0.18.2
- jinja2==3.1.2
- joblib==1.2.0
- jsonschema==4.17.3
- jupyter-client==7.4.9
- jupyter-core==5.1.3
- jupyterlab-widgets==3.0.5
- kiwisolver==1.4.4
- lark==1.1.5
- markupsafe==2.1.1
- matplotlib==3.6.3
- matplotlib-inline==0.1.6
- nbformat==5.5.0
- nest-asyncio==1.5.6
- ninja==1.11.1
- open3d==0.16.0
- pandas==1.5.2
- parso==0.8.3
- pccm==0.4.4
- pexpect==4.8.0
- pickleshare==0.7.5
- pkgutil-resolve-name==1.3.10
- platformdirs==2.6.2
- plotly==5.12.0
- pointops==1.0
- portalocker==2.6.0
- prompt-toolkit==3.0.36
- psutil==5.9.4
- ptyprocess==0.7.0
- pure-eval==0.2.2
- pybind11==2.10.3
- pygments==2.14.0
- pyparsing==3.0.9
- pyquaternion==0.9.9
- pyrsistent==0.19.3
- python-dateutil==2.8.2
- pytz==2022.7.1
- pyzmq==25.0.0
- scikit-learn==1.2.0
- spconv-cu113==2.2.6
- stack-data==0.6.2
- tenacity==8.1.0
- threadpoolctl==3.1.0
- torch-geometric==2.2.0
- tornado==6.2
- traitlets==5.8.1
- wcwidth==0.2.6
- werkzeug==2.2.2
- widgetsnbextension==4.0.5
prefix: /opt/conda/envs/pcr
I just checked the package's version and no irregular was found. Meanwhile, I noticed that the error occurred at pcr/utils/comm.py L86 (torch.cuda.current_device()
). And I think another way to check the environment is run the following python code:
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())
Hey, I switched to a server with Nvidia driver version 470.57.02
and it stopped giving errors.
It seems like the issue was with the version of nvidia driver, 450.80.02
can almost fully support pytorch 1.10 but is not compatible with some dependencies of torch.distributed
.
Thank you for your constant and warm help! @Gofinge
Great! I will close the issue, and if you have any questions, please feel free to open an issue.
Hi, I met the same issue, I want to konw if there is any other solution without changing the NVIDIA driver version?
Hi, thank you for your open codes of PTv2 and PCR. It can be foreseen that this repository will provide great convenience for the research in the community.
I follow the README.md to install the environment and then run the training script, but it return the error as follows:
It seems that the problem lies in the incorrect version of
cudatoolkit
, but in fact, I have ensured that the cudatoolkit version used for compilingpointops
and that installed for pytorch are the same (both 11.3).Do you have any idea about this RuntimeError?