MPI-Dortmund / cryolo

cryolo documentation
8 stars 0 forks source link

cryolo can only see 1 GPU #10

Closed rui--zhang closed 2 years ago

rui--zhang commented 2 years ago

Hi I have eight RTX 3090 cards in my workstation, but it seems cryolo can only recognize 1 GPU. Please advise how to fix this issue.

The command I ran: '/home/zhangrui/.conda/envs/cryolo/bin/python3.8' -u '/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py' --ignore-gooey train -c 'config_cryolo.json' -w '5' -g 0 1 2 3 4 5 6 7 -nc '-1' --gpu_fraction '1.0' -e '10' -lft '2' --seed '10'

(If I change -g 0 1 2 3 4 5 6 7 to -g 0, the job runs fine.)

Thanks! Rui


Traceback (most recent call last): File "/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py", line 8, in sys.exit(main()) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 455, in main Gooey( File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/gooey/python_bindings/gooey_decorator.py", line 134, in return lambda *args, *kwargs: func(args, **kwargs) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 424, in main train.main(args) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/train.py", line 753, in main parallel_model = multi_gpu_model(yolo.model, gpus=num_gpus, cpu_merge=False) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/keras/utils/multi_gpu_utils.py", line 178, in multi_gpu_model raise ValueError( ValueError: To call multi_gpu_model with gpus=8, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducing gpus.

thorstenwagner commented 2 years ago

Hi,

the error messages tells me that cryolo does  not see an GPU at all:

However this machine only has: ['/cpu:0'].

It only sees the CPU.

Did you install cryolo with cuda 11 support? If no, please try that:

https://cryolo.readthedocs.io/en/stable/installation.html#with-cuda-11

Best, Thorsten

30.04.2022 23:27:47 Rui Zhang @.***>:

Hi I have eight RTX 3090 cards in my workstation, but it seems cryolo can only recognize 1 GPU. Please advise how to fix this issue.

The command I ran: '/home/zhangrui/.conda/envs/cryolo/bin/python3.8' -u '/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py' --ignore-gooey train -c 'config_cryolo.json' -w '5' -g 0 1 2 3 4 5 6 7 -nc '-1' --gpu_fraction '1.0' -e '10' -lft '2' --seed '10'

(If I change -g 0 1 2 3 4 5 6 7 to -g 0, the job runs fine.)

Thanks! Rui


Traceback (most recent call last): File "/home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py", line 8, in sys.exit(/main/()) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 455, in /main/ Gooey( File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/gooey/python_bindings/gooey_decorator.py", line 134, in return lambda *args, *kwargs: func(args, *kwargs) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/cryolo_main.py", line 424, in main train.main(args) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/cryolo/train.py", line 753, in main parallel_model = multi_gpu_model(yolo.model, gpus=num_gpus, cpu_merge=False) File "/home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/keras/utils/multi_gpu_utils.py", line 178, in multi_gpu_model raise ValueError( ValueError: To call multi_gpu_model with gpus=8, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducing gpus*.

— Reply to this email directly, view it on GitHub[https://github.com/MPI-Dortmund/cryolo/issues/10], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66NBZG5VYZVRQHHF4ZLVHWQVDANCNFSM5UY35LGA]. You are receiving this because you are subscribed to this thread.[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66O6PTCTH3Y5PLSLDYTVHWQVDA5CNFSM5UY35LGKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4SGVDALA.gif]

rui--zhang commented 2 years ago

Hi Thorsten,

Thank you so much for the prompt reply! Yes I did install cryolo with cuda 11 support. It seems to be running on GPU 0 (see the result from "nvidia-smi" below):

Screen Shot 2022-04-30 at 5 01 56 PM

rui--zhang commented 2 years ago

Actually, I can see another error message "Could not load dynamic library 'libcublas.so.11'", but the job could continue to run.

##################################################### /home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py train -c /data/DMT/config_cryolo.json -w 5 -g 0 -nc 80 --gpu_fraction 1.0 -e 10 -lft 2 --seed 10 ##################################################### 2022-04-30 16:19:40.836541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation. Using TensorFlow backend. 2022-04-30 16:19:41.520615: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2022-04-30 16:19:41.528025: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341ba70 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.528074: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2022-04-30 16:19:41.533094: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-04-30 16:19:41.782342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341dba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.782410: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6 2022-04-30 16:19:41.787381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties: name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695 pciBusID: 0000:1b:00.0 2022-04-30 16:19:41.787492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2022-04-30 16:19:41.791541: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11; LD_LIBRARY_PATH: /usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/openmpi/4.0.5/lib64:/opt/pymol/lib64:/usr/local/lib:/usr/local/cuda-11.1/lib64:/usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/pymol/lib64:/usr/local/lib:/opt/pymol/lib64:/usr/local/lib: 2022-04-30 16:19:41.827655: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-04-30 16:19:41.827902: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-04-30 16:19:41.830802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2022-04-30 16:19:41.831667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2022-04-30 16:19:41.831780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2022-04-30 16:19:41.831791: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1689] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-04-30 16:19:41.831810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-04-30 16:19:41.831817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0 2022-04-30 16:19:41.831823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N

thorstenwagner commented 2 years ago

Hi Rui,

the memory usage is far too low. I don't think it using it. How long that the prediction need for your number of micrographs?

Can you run

conda list >> packages.txt

and send me the the textfile?

Best, Thorsten

30.04.2022 23:59:17 Rui Zhang @.***>:

Hi Thorsten,

Thank you so much for the prompt reply! Yes I did install cryolo with cuda 11 support. It seems to be running on GPU 0 (see the result below):

@.***:~$ nvidia-smi |head -40 Sat Apr 30 16:56:44 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:1B:00.0 Off | N/A | | 30% 37C P8 19W / 350W | 261MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 3090 Off | 00000000:1C:00.0 Off | N/A | | 30% 33C P8 17W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 GeForce RTX 3090 Off | 00000000:1D:00.0 Off | N/A | | 30% 36C P8 18W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 GeForce RTX 3090 Off | 00000000:1E:00.0 Off | N/A | | 30% 36C P8 28W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 GeForce RTX 3090 Off | 00000000:B2:00.0 Off | N/A | | 30% 35C P8 28W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 GeForce RTX 3090 Off | 00000000:B3:00.0 Off | N/A | | 30% 35C P8 22W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 GeForce RTX 3090 Off | 00000000:B4:00.0 Off | N/A | | 30% 35C P8 18W / 350W | 5MiB / 24268MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 GeForce RTX 3090 Off | 00000000:B5:00.0 Off | N/A | | 30% 34C P8 27W / 350W | 5MiB / 24268MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

— Reply to this email directly, view it on GitHub[https://github.com/MPI-Dortmund/cryolo/issues/10#issuecomment-1114060641], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66KGJKLVTOXEVIVII6DVHWULHANCNFSM5UY35LGA]. You are receiving this because you commented.[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66LG7UJOZ3CEAYXTM53VHWULHA5CNFSM5UY35LGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIJTTOYI.gif]

thorstenwagner commented 2 years ago

Now I think your path variable might be the problem. While having the cryolo environment  activated, can you run

echo $PATH

and send me the output ?

Best Thorsten

01.05.2022 00:12:10 Rui Zhang @.***>:

Actually, I can see another error message "Could not load dynamic library 'libcublas.so.11'", but the job could continue to run.

##################################################### /home/zhangrui/.conda/envs/cryolo/bin/cryolo_gui.py train -c /data/DMT/config_cryolo.json -w 5 -g 0 -nc 80 --gpu_fraction 1.0 -e 10 -lft 2 --seed 10 ##################################################### 2022-04-30 16:19:40.836541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation. Using TensorFlow backend. 2022-04-30 16:19:41.520615: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2022-04-30 16:19:41.528025: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341ba70 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.528074: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2022-04-30 16:19:41.533094: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2022-04-30 16:19:41.782342: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5593c341dba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2022-04-30 16:19:41.782410: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 3090, Compute Capability 8.6 2022-04-30 16:19:41.787381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1666] Found device 0 with properties: name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695 pciBusID: 0000:1b:00.0 2022-04-30 16:19:41.787492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2022-04-30 16:19:41.791541: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: /home/zhangrui/.conda/envs/cryolo/lib/python3.8/site-packages/tensorflow_core/python/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtHSHMatmulAlgoInit, version libcublasLt.so.11; LD_LIBRARY_PATH: /usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/openmpi/4.0.5/lib64:/opt/pymol/lib64:/usr/local/lib:/usr/local/cuda-11.1/lib64:/usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/pymol/lib64:/usr/local/lib:/opt/pymol/lib64:/usr/local/lib: 2022-04-30 16:19:41.827655: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2022-04-30 16:19:41.827902: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2022-04-30 16:19:41.830802: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2022-04-30 16:19:41.831667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2022-04-30 16:19:41.831780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2022-04-30 16:19:41.831791: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1689] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2022-04-30 16:19:41.831810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1206] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-04-30 16:19:41.831817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] 0 2022-04-30 16:19:41.831823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1225] 0: N

— Reply to this email directly, view it on GitHub[https://github.com/MPI-Dortmund/cryolo/issues/10#issuecomment-1114062398], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAIP66KWWP5OCMUR6P4TSFTVHWV3RANCNFSM5UY35LGA]. You are receiving this because you commented.[Verfolgungsbild][https://github.com/notifications/beacon/AAIP66ILVNU3K3BLPCKQQMLVHWV3RA5CNFSM5UY35LGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIJTT4PQ.gif]

rui--zhang commented 2 years ago

It took 2-3 hours for the training step on 15 micrographs. The prediction step is pretty fast. Here is the result after running conda list >> packages.txt packages.txt

rui--zhang commented 2 years ago

this is the result after doing "echo $PATH"

/home/zhangrui/.conda/envs/cryolo/bin:/opt/miniconda3/condabin:/opt/miniconda3/bin:/home/zhangrui/.local/bin:/home/zhangrui/bin:/home/zhangrui/bin/pyem:/data/colabfold_batch/bin:/home/zhangrui/bin/cryosparc2/cryosparc_master/bin:/home/zhangrui/bin/cryosparc2/cryosparc_master/bin:/usr/local/relion/bin:/usr/local/phenix-1.19.2-4158/build/bin:/home/zhangrui/bin/bsoft-1.7/bin:/home/zhangrui/bin/EMAN/bin:/data/bin2/ccpem-1.5.0/bin:/data/ccp4-7.1/etc:/data/ccp4-7.1/bin:/usr/local/cuda-11.1/bin:/opt/openmpi/4.0.5/bin:/opt/bin:/opt/pymol/bin:/opt/cistem/1.0.0:/opt/frealign/9.11/bin:/usr/local/IMOD/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/zhangrui/bin/LIBG/bin:/home/zhangrui/bin/spiderweb-19.08/spider/bin

rui--zhang commented 2 years ago

And this is the result after doing "echo $LD_LIBRARY_PATH" /usr/local/relion/lib:/home/zhangrui/bin/relion3.1/external/fltk/lib:/home/zhangrui/bin/relion3.1/external/fftw/lib:/home/zhangrui/bin/bsoft-1.7/lib::/home/zhangrui/bin/EMAN/lib:/usr/local/lib:/usr/lib:/opt/openmpi/4.0.5/lib64:/opt/pymol/lib64:/usr/local/lib:/usr/local/cuda-11.1/lib64:/opt/pymol/lib64:/usr/local/lib:

thorstenwagner commented 2 years ago

Your PATH looks good. We wonder if your LD_LIBRARY_PATH interferes somehow with crYOLO, could you try to run it like this:

LD_LIBRARY_PATH='' cryolo_predict.py [... your arguments here]

It took 2-3 hours for the training step on 15 micrographs.

This should rather take 10-15 minutes. Another indication that cryolo is actually using the CPUs.

rui--zhang commented 2 years ago

It works now! The 8 GPUs are at full speed. Thank you so much for the help 😊 Screen Shot 2022-05-02 at 8 02 29 AM

thorstenwagner commented 2 years ago

Glad to hear that it works! So LD_LIBRARY_PATH='' did the trick?

rui--zhang commented 2 years ago

Yes, it did the trick!