Cannot dlopen some GPU libraries

minwang-ai commented 3 years ago

no_lib_error

Hi all, I followed ReadMe for installation but I got lib errors when I train models via train.py. I have checked tf-GPU, cuda and cudnn versions. Could you help me figure it out?

danielS91 commented 3 years ago

We are facing the same errors/warnings. It tells you that your version of TensorFlow is not compatible with the CUDA / cuDNN version installed by PyTorch. However, we do not use any CUDA / cuDNN stuff from TensorFlow. TensorFlow is only used to speed up confusion matrix calculation. Therefore, you can simply ignore these warnings/errors.

However, if you want to get rid of these errors/warnings, you can simply switch to the CPU version of TensorFlow by replacing tensorflow-gpu=1.15.0 with tensorflow-cpu=1.15.0 in the environment file or using pip if the environment is already created. The CPU version works well.

minwang-ai commented 3 years ago

We are facing the same errors/warnings. It tells you that your version of TensorFlow is not compatible with the CUDA / cuDNN version installed by PyTorch. However, we do not use any CUDA / cuDNN stuff from TensorFlow. TensorFlow is only used to speed up confusion matrix calculation. Therefore, you can simply ignore these warnings/errors.

However, if you want to get rid of these errors/warnings, you can simply switch to the CPU version of TensorFlow by replacing tensorflow-gpu=1.15.0 with tensorflow-cpu=1.15.0 in the environment file or using pip if the environment is already created. The CPU version works well.

Hi Daniel, thank you for your reply! I thought I cannot run the code with errors before and now I noticed that the output file is renewed. The tf errors disappear after loading cuda 10.0.130 and cudnn 10.1_v7.5

We can also ignore the other warning e.g., UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.

mona0809 commented 3 years ago

Yes, since we always pass the epoch explicitly as a parameter (https://github.com/TUI-NICR/ESANet/blob/main/train.py#L264), the warning can be ignored here.

minwang-ai commented 3 years ago

The tf errors disappear after loading cuda 10.0.130 and cudnn 10.1_v7.5

The mIoU decreases in this case so I changed them back to 10.1 and 7.6 respectively.

zhouqunbing commented 1 year ago

i replace tensorflow-gpu=1.15.0 with tensorflow-cpu=1.15.0,but when model in the test phase,the same error occurred. Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory...

TUI-NICR / ESANet

Cannot dlopen some GPU libraries #24