cherubicXN / hawp

Holistically-Attracted Wireframe Parsing [TPAMI'23] & [CVPR' 20]
MIT License
298 stars 52 forks source link

Failed to run the program #14

Closed hellodrx closed 3 years ago

hellodrx commented 3 years ago

Hi, Nan Xue,

Thanks for your excellent work.

I really like your work. After following the instructions to prepare the env and running the script, I got the issue below

(hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# CUDA_VISIBLE_DEVICES=1 python scripts/train.py --config-file config-files/hawp.yaml Traceback (most recent call last): File "scripts/train.py", line 8, in <module> from parsing.detector import WireframeDetector File "/root/home/deng/projects/wire_frame/hawp/parsing/detector.py", line 4, in <module> from parsing.encoder.hafm import HAFMencoder File "/root/home/deng/projects/wire_frame/hawp/parsing/encoder/__init__.py", line 1, in <module> from .hafm import HAFMencoder File "/root/home/deng/projects/wire_frame/hawp/parsing/encoder/hafm.py", line 2, in <module> from parsing import _C ImportError: /root/home/deng/projects/wire_frame/hawp/parsing/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _Z13lsencode_cudaRKN2at6TensorEiiiii

My system is (hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.4 LTS Release: 18.04 Codename: bionic

My cuda driver and toolkit version is NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2

Do you have any clue that can help me out? Really appreciate your assistance.

hellodrx commented 3 years ago

It might be something related to cuda version and anaconda cuda toolkit. But I am not sure. The whole compile process went very smoothly without even an error.

cherubicXN commented 3 years ago

Hi, did you check your cudatoolkit version?

hellodrx commented 3 years ago

Hi, did you check your cudatoolkit version?

The cudatoolkit version in anaconda env is 10.0.130.

(hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# conda list Name Version Build Channel _libgcc_mutex 0.1 main defaults blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free certifi 2016.2.28 py36_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free cudatoolkit 10.0.130 0 defaults cycler 0.10.0 pypi_0 pypi cython 0.29.21 pypi_0 pypi decorator 4.4.2 pypi_0 pypi freetype 2.5.5 2 https:

cherubicXN commented 3 years ago

but your CUDA version is NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2

hellodrx commented 3 years ago

So this might be the issue. I followed the instruction conda install pytorch torchvision cudatoolkit=10.0 -c pytorch and Anaconda selected the cudatoolkit version -.-.

cherubicXN commented 3 years ago

So this might be the issue. I followed the instruction conda install pytorch torchvision cudatoolkit=10.0 -c pytorch and Anaconda selected the cudatoolkit version -.-.

That's because my CUDA version is 10.0 :)

hellodrx commented 3 years ago

So this might be the issue. I followed the instruction conda install pytorch torchvision cudatoolkit=10.0 -c pytorch and Anaconda selected the cudatoolkit version -.-.

That's because my CUDA version is 10.0 :)

I deleted the whole env and reinstall pytorch, torchvision, cudatoolkit and all the pip packages. The error still exists. Now my cudatoolkit version is cudatoolkit 10.2.89

hellodrx commented 3 years ago

It is very weird that I did not see any error in the compile process.

`(hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# python setup.py build_ext --inplace running build_ext building 'parsing._C' extension creating /root/home/deng/projects/wire_frame/hawp/build creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6 creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing creating /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc Emitting ninja build file /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/1] c++ -MMD -MF /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o.d -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/home/deng/projects/wire_frame/hawp/parsing/csrc -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/TH -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/THC -I/root/anaconda3/envs/hawp/include/python3.6m -c -c /root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.cpp -o /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ In file included from /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/ATen/Parallel.h:149:0, from /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/utils.h:3, from /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/nn/cloneable.h:5, from /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/nn.h:3, from /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include/torch/all.h:12, from /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/extension.h:4, from /root/home/deng/projects/wire_frame/hawp/parsing/csrc/cuda/vision.h:2, from /root/home/deng/projects/wire_frame/hawp/parsing/csrc/linesegment.h:2, from /root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.cpp:1: /root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/ATen/ParallelOpenMP.h:84:0: warning: ignoring #pragma omp parallel [-Wunknown-pragmas]

pragma omp parallel for if ((end - begin) >= grain_size)

creating build/lib.linux-x86_64-3.6 creating build/lib.linux-x86_64-3.6/parsing g++ -pthread -shared -L/root/anaconda3/envs/hawp/lib -Wl,-rpath=/root/anaconda3/envs/hawp/lib,--no-as-needed /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o -L/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/lib -L/root/anaconda3/envs/hawp/lib -lc10 -ltorch -ltorch_cpu -ltorch_python -lpython3.6m -o build/lib.linux-x86_64-3.6/parsing/_C.cpython-36m-x86_64-linux-gnu.so copying build/lib.linux-x86_64-3.6/parsing/_C.cpython-36m-x86_64-linux-gnu.so -> parsing

(hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# CUDA_VISIBLE_DEVICES=1 python scripts/train.py --config-file config-files/hawp.yaml Traceback (most recent call last): File "scripts/train.py", line 8, in from parsing.detector import WireframeDetector File "/root/home/deng/projects/wire_frame/hawp/parsing/detector.py", line 4, in from parsing.encoder.hafm import HAFMencoder File "/root/home/deng/projects/wire_frame/hawp/parsing/encoder/init.py", line 1, in from .hafm import HAFMencoder File "/root/home/deng/projects/wire_frame/hawp/parsing/encoder/hafm.py", line 2, in from parsing import _C ImportError: /root/home/deng/projects/wire_frame/hawp/parsing/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _Z13lsencode_cudaRKN2at6TensorEiiiii ` But the same error exists.

cherubicXN commented 3 years ago

Could you directly import the compiled _C in the corresponding directory like this

import torch
import _C

print(dir(_C))
hellodrx commented 3 years ago

Sorry for the late reply.

Unfortunately, I cannot import _C module

import torch import _C Traceback (most recent call last): File "", line 1, in ImportError: /root/home/deng/projects/wire_frame/hawp/parsing/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _Z13lsencode_cudaRKN2at6TensorEiiiii

cherubicXN commented 3 years ago

Did you try to write a small CUDA function in the current environment? This issue is usually caused by the incompatible environment.

hellodrx commented 3 years ago

What do you mean by a small CUDA function, like a minimal CUDA example using Numba or Pytorch in the current env?

cherubicXN commented 3 years ago

Yes. Refer to https://pytorch.org/tutorials/advanced/cpp_extension.html

hellodrx commented 3 years ago

I can run Pytorch project in this new env.

(hawp) root@ubuntu1:~/home/deng/projects/edge# CUDA_VISIBLE_DEVICES=1 python ablation_study.py Epoch 0/99

[21-01-31 03:07:45 PM] epoch 0, iteration 1, final_loss : 1723.3910 [21-01-31 03:07:46 PM] epoch 0, iteration 2, final_loss : 1732.1219 [21-01-31 03:07:47 PM] epoch 0, iteration 3, final_loss : 1606.6276 [21-01-31 03:07:48 PM] epoch 0, iteration 4, final_loss : 1590.3911 [21-01-31 03:07:49 PM] epoch 0, iteration 5, final_loss : 1508.3062 [21-01-31 03:07:51 PM] epoch 0, iteration 6, final_loss : 1462.1621 [21-01-31 03:07:52 PM] epoch 0, iteration 7, final_loss : 1401.9430 ^CTraceback (most recent call last): File "ablation_study.py", line 514, in train_model(edge_extractor, optimizer, step_lr_scheduler) File "ablation_study.py", line 375, in train_model optimizer.step() File "/root/anaconda3/envs/hawp/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/hawp/lib/python3.7/site-packages/torch/optim/adam.py", line 119, in step group['eps'] File "/root/anaconda3/envs/hawp/lib/python3.7/site-packages/torch/optim/functional.py", line 83, in adam grad = grad.add(param, alpha=weight_decay) KeyboardInterrupt

cherubicXN commented 3 years ago

How about the result of nvcc --version ?

hellodrx commented 3 years ago

(hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# nvcc --version Command 'nvcc' not found, but can be installed with: apt install nvidia-cuda-toolkit

But the system can go with nvidia-smi

(hawp) root@ubuntu1:~/home/deng/projects/wire_frame/hawp# nvidia-smi Sun Jan 31 15:15:24 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:3B:00.0 Off | 0 | | N/A 66C P0 68W / 70W | 14624MiB / 15109MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:AF:00.0 Off | 0 | | N/A 27C P8 9W / 70W | 11MiB / 15109MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 37768 C python 14613MiB | +-----------------------------------------------------------------------------+

cherubicXN commented 3 years ago

maybe you need to install a full CUDA development kit on your device.

cherubicXN commented 3 years ago

[1/1] c++ -MMD -MF /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o.d -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/home/deng/projects/wire_frame/hawp/parsing/csrc -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/TH -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/THC -I/root/anaconda3/envs/hawp/include/python3.6m -c -c /root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.cpp -o /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14

I carefully read your compile log and found that the nvcc is not called

hellodrx commented 3 years ago

It is really too late, sorry for taking up so much time. I will find another machine to check up the env.

Yes, I will contact the administrator to see if I can install the driver.

Thanks again for your reply, grateful for the support!

hellodrx commented 3 years ago

[1/1] c++ -MMD -MF /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o.d -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/home/deng/projects/wire_frame/hawp/parsing/csrc -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/TH -I/root/anaconda3/envs/hawp/lib/python3.6/site-packages/torch/include/THC -I/root/anaconda3/envs/hawp/include/python3.6m -c -c /root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.cpp -o /root/home/deng/projects/wire_frame/hawp/build/temp.linux-x86_64-3.6/root/home/deng/projects/wire_frame/hawp/parsing/csrc/vision.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14

I carefully read your compile log and found that the nvcc is not called

It seems like the issue is located, but I have to contact the administrator of the machine. Thank you for your help.

cherubicXN commented 3 years ago

okay, good luck to you. Please let me know the result after installing the full CUDA development kit.

hellodrx commented 3 years ago

okay, good luck to you. Please let me know the result after installing the full CUDA development kit.

Sure, sweet dreams!

hellodrx commented 3 years ago

okay, good luck to you. Please let me know the result after installing the full CUDA development kit.

Problem solved! It is caused by an incomplete installation of the Cuda toolkit. I checked /usr/local/ in the server, and found no
cuda folder. That's why nvcc cannot be located.

I found another machine that installed the cuda from the official .sh file, and the program can work now.

Thank you very much for your kind help.