Closed rui--zhang closed 2 years ago
Hi,
the error messages tells me that cryolo does not see an GPU at all:
However this machine only has: ['/cpu:0'].
It only sees the CPU.
Did you install cryolo with cuda 11 support? If no, please try that:
https://cryolo.readthedocs.io/en/stable/installation.html#with-cuda-11
Best, Thorsten
30.04.2022 23:27:47 Rui Zhang @.***>:
Hi I have eight RTX 3090 cards in my workstation, but it seems cryolo can only recognize 1 GPU. Please advise how to fix this issue.
The command I ran: '/home/zhangrui/.conda/envs/cryolo/bin/python3.8' -u '/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py' --ignore-gooey train -c 'config_cryolo.json' -w '5' -g 0 1 2 3 4 5 6 7 -nc '-1' --gpu_fraction '1.0' -e '10' -lft '2' --seed '10'
(If I change -g 0 1 2 3 4 5 6 7 to -g 0, the job runs fine.)
Thanks! Rui
Traceback (most recent call last): File "/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py", line 8, in sys.exit(/main/()) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 455, in /main/ Gooey( File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/gooey/python_bindings/gooey_decorator.py", line 134, in return lambda *args, *kwargs: func(args, *kwargs) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 424, in main train.main(args) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/train.py", line 753, in main parallel_model = multi_gpu_model(yolo.model, gpus=num_gpus, cpu_merge=False) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/keras/utils/multi_gpu_utils.py", line 178, in multi_gpu_model raise ValueError( ValueError: To call multi_gpu_model with gpus=8, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducing gpus*.
— Reply to this email directly, view it on GitHub[https://github.com/MPI-Dortmund/cryolo/issues/10], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66NBZG5VYZVRQHHF4ZLVHWQVDANCNFSM5UY35LGA]. You are receiving this because you are subscribed to this thread.[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66O6PTCTH3Y5PLSLDYTVHWQVDA5CNFSM5UY35LGKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4SGVDALA.gif]
Hi Thorsten,
Thank you so much for the prompt reply! Yes I did install cryolo with cuda 11 support. It seems to be running on GPU 0 (see the result from "nvidia-smi" below):
Actually, I can see another error message "Could not load dynamic library 'libcublas.so.11'", but the job could continue to run.
##################################################### /home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py train -c /data/DMT/config_cryolo.json -w 5 -g 0 -nc 80 --gpu_fraction 1.0 -e 10 -lft 2 --seed 10 ##################################################### 2022-04-30 16:19:40.836541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation. Using TensorFlow backend. 2022-04-30 16:19:41.520615: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2022-04-30 16:19:41.528025: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341ba70 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.528074: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2022-04-30 16:19:41.533094: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-04-30 16:19:41.782342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341dba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.782410: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6 2022-04-30 16:19:41.787381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties: name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695 pciBusID: 0000:1b:00.0 2022-04-30 16:19:41.787492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2022-04-30 16:19:41.791541: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11; LD_LIBRARY_PATH: /usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/openmpi/4.0.5/lib64:/opt/pymol/lib64:/usr/local/lib:/usr/local/cuda-11.1/lib64:/usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/pymol/lib64:/usr/local/lib:/opt/pymol/lib64:/usr/local/lib: 2022-04-30 16:19:41.827655: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-04-30 16:19:41.827902: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-04-30 16:19:41.830802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2022-04-30 16:19:41.831667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2022-04-30 16:19:41.831780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2022-04-30 16:19:41.831791: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1689] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-04-30 16:19:41.831810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-04-30 16:19:41.831817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0 2022-04-30 16:19:41.831823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
Hi Rui,
the memory usage is far too low. I don't think it using it. How long that the prediction need for your number of micrographs?
Can you run
conda list >> packages.txt
and send me the the textfile?
Best, Thorsten
30.04.2022 23:59:17 Rui Zhang @.***>:
Hi Thorsten,
Thank you so much for the prompt reply! Yes I did install cryolo with cuda 11 support. It seems to be running on GPU 0 (see the result below):
@.***:~$ nvidia-smi |head -40 Sat Apr 30 16:56:44 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:1B:00.0 Off | N/A | | 30% 37C P8 19W / 350W | 261MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 3090 Off | 00000000:1C:00.0 Off | N/A | | 30% 33C P8 17W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 GeForce RTX 3090 Off | 00000000:1D:00.0 Off | N/A | | 30% 36C P8 18W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 GeForce RTX 3090 Off | 00000000:1E:00.0 Off | N/A | | 30% 36C P8 28W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 GeForce RTX 3090 Off | 00000000:B2:00.0 Off | N/A | | 30% 35C P8 28W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 GeForce RTX 3090 Off | 00000000:B3:00.0 Off | N/A | | 30% 35C P8 22W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 GeForce RTX 3090 Off | 00000000:B4:00.0 Off | N/A | | 30% 35C P8 18W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 GeForce RTX 3090 Off | 00000000:B5:00.0 Off | N/A | | 30% 34C P8 27W / 350W | 5MiB / 24268MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
— Reply to this email directly, view it on GitHub[https://github.com/MPI-Dortmund/cryolo/issues/10#issuecomment-1114060641], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66KGJKLVTOXEVIVII6DVHWULHANCNFSM5UY35LGA]. You are receiving this because you commented.[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66LG7UJOZ3CEAYXTM53VHWULHA5CNFSM5UY35LGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIJTTOYI.gif]
Now I think your path variable might be the problem. While having the cryolo environment activated, can you run
echo $PATH
and send me the output ?
Best Thorsten
01.05.2022 00:12:10 Rui Zhang @.***>:
Actually, I can see another error message "Could not load dynamic library 'libcublas.so.11'", but the job could continue to run.
##################################################### /home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py train -c /data/DMT/config_cryolo.json -w 5 -g 0 -nc 80 --gpu_fraction 1.0 -e 10 -lft 2 --seed 10 ##################################################### 2022-04-30 16:19:40.836541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation. Using TensorFlow backend. 2022-04-30 16:19:41.520615: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2022-04-30 16:19:41.528025: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341ba70 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.528074: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2022-04-30 16:19:41.533094: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-04-30 16:19:41.782342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341dba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.782410: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6 2022-04-30 16:19:41.787381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties: name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695 pciBusID: 0000:1b:00.0 2022-04-30 16:19:41.787492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2022-04-30 16:19:41.791541: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11; LD_LIBRARY_PATH: /usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/openmpi/4.0.5/lib64:/opt/pymol/lib64:/usr/local/lib:/usr/local/cuda-11.1/lib64:/usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/pymol/lib64:/usr/local/lib:/opt/pymol/lib64:/usr/local/lib: 2022-04-30 16:19:41.827655: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-04-30 16:19:41.827902: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-04-30 16:19:41.830802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2022-04-30 16:19:41.831667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2022-04-30 16:19:41.831780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2022-04-30 16:19:41.831791: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1689] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-04-30 16:19:41.831810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-04-30 16:19:41.831817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0 2022-04-30 16:19:41.831823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N
— Reply to this email directly, view it on GitHub[https://github.com/MPI-Dortmund/cryolo/issues/10#issuecomment-1114062398], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66KWWP5OCMUR6P4TSFTVHWV3RANCNFSM5UY35LGA]. You are receiving this because you commented.[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66ILVNU3K3BLPCKQQMLVHWV3RA5CNFSM5UY35LGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIJTT4PQ.gif]
It took 2-3 hours for the training step on 15 micrographs. The prediction step is pretty fast. Here is the result after running conda list >> packages.txt packages.txt
this is the result after doing "echo $PATH"
/home/zhangrui/.conda/envs/cryolo/bin:/opt/miniconda3/condabin:/opt/miniconda3/bin:/home/zhangrui/.local/bin:/home/zhangrui/bin:/home/zhangrui/bin/pyem:/data/colabfold_batch/bin:/home/zhangrui/bin/cryosparc2/cryosparc_master/bin:/home/zhangrui/bin/cryosparc2/cryosparc_master/bin:/usr/local/relion/bin:/usr/local/phenix-1.19.2-4158/build/bin:/home/zhangrui/bin/bsoft-1.7/bin:/home/zhangrui/bin/EMAN/bin:/data/bin2/ccpem-1.5.0/bin:/data/ccp4-7.1/etc:/data/ccp4-7.1/bin:/usr/local/cuda-11.1/bin:/opt/openmpi/4.0.5/bin:/opt/bin:/opt/pymol/bin:/opt/cistem/1.0.0:/opt/frealign/9.11/bin:/usr/local/IMOD/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/zhangrui/bin/LIBG/bin:/home/zhangrui/bin/spiderweb-19.08/spider/bin
And this is the result after doing "echo $LD_LIBRARY_PATH" /usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/openmpi/4.0.5/lib64:/opt/pymol/lib64:/usr/local/lib:/usr/local/cuda-11.1/lib64:/opt/pymol/lib64:/usr/local/lib:
Your PATH looks good. We wonder if your LD_LIBRARY_PATH
interferes somehow with crYOLO, could you try to run it like this:
LD_LIBRARY_PATH='' cryolo_predict.py [... your arguments here]
It took 2-3 hours for the training step on 15 micrographs.
This should rather take 10-15 minutes. Another indication that cryolo is actually using the CPUs.
It works now! The 8 GPUs are at full speed. Thank you so much for the help 😊
Glad to hear that it works! So LD_LIBRARY_PATH=''
did the trick?
Yes, it did the trick!
Hi I have eight RTX 3090 cards in my workstation, but it seems cryolo can only recognize 1 GPU. Please advise how to fix this issue.
The command I ran: '/home/zhangrui/.conda/envs/cryolo/bin/python3.8' -u '/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py' --ignore-gooey train -c 'config_cryolo.json' -w '5' -g 0 1 2 3 4 5 6 7 -nc '-1' --gpu_fraction '1.0' -e '10' -lft '2' --seed '10'
(If I change -g 0 1 2 3 4 5 6 7 to -g 0, the job runs fine.)
Thanks! Rui
Traceback (most recent call last): File "/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py", line 8, in
sys.exit(main())
File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 455, in main
Gooey(
File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/gooey/python_bindings/gooey_decorator.py", line 134, in
return lambda *args, *kwargs: func(args, **kwargs)
File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 424, in main
train.main(args)
File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/train.py", line 753, in main
parallel_model = multi_gpu_model(yolo.model, gpus=num_gpus, cpu_merge=False)
File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/keras/utils/multi_gpu_utils.py", line 178, in multi_gpu_model
raise ValueError(
ValueError: To call
multi_gpu_model
withgpus=8
, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducinggpus
.