HI All,
I just complied the latest relion v5.0-beta-commit-12cf15 on our new cluster (OS: OpenSuse 15.5 and has CUDA 12.2). Running Jobs ( for instance CLass2D ) runs nicely on 4 gpus (A100) , but model_angelo with the tutorial data fails with > 1 GPU, but succeeds with 1 GPU.
This is both true if i run it out of the GUI with SLURM, or if I run it directly when ssh-ing onto the GPU.
---------------------------- ModelAngelo -----------------------------
By Kiarash Jamali, Scheres Group, MRC Laboratory of Molecular Biology
--------------------- Initial C-alpha prediction ---------------------
0%| | 0/1000 [00:00<?, ?it/s]2024-09-10 at 14:30:24 | ERROR | Error in ModelAngelo
Traceback (most recent call last):
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/c_alpha/inference.py", line 245, in infer
meta_net_output = wrapper(meta_batch_list)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 155, in forward
InferenceData(data=send_dict_to_device(data, device), status=1)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 36, in send_dict_to_device
dictionary[key] = dictionary[key].to(device)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/apps/build.py", line 208, in main
ca_cif_path = c_alpha_infer(ca_infer_args)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/c_alpha/inference.py", line 219, in infer
with MultiGPUWrapper(model_definition_path, state_dict_path, device_names) as wrapper:
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 186, in __exit__
self.__del__()
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 180, in __del__
self.proc_ctx.join()
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 85, in run_inference
model = init_model(model_definition_path, state_dict_path, device)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 69, in init_model
model.to(device)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
0%| | 0/1000 [00:03<?, ?it/s]
Exception ignored in: <function MultiGPUWrapper.__del__ at 0x7f2b74abb640>
Traceback (most recent call last):
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/model_angelo/models/multi_gpu_wrapper.py", line 180, in __del__
self.proc_ctx.join()
File "/opt/psi/overlays/Alps/EM/relion/5.0-2beta/miniconda/envs/relion-5.0-2beta/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGTERM
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
As I said, I can successfully run other relion jobs on the A100 with 4 gpus, and I also can run Model-Angelo with 1 Gpu.
I also checked : (and nvidia-smi)
Describe your problem
HI All, I just complied the latest relion v5.0-beta-commit-12cf15 on our new cluster (OS: OpenSuse 15.5 and has CUDA 12.2). Running Jobs ( for instance CLass2D ) runs nicely on 4 gpus (A100) , but model_angelo with the tutorial data fails with > 1 GPU, but succeeds with 1 GPU. This is both true if i run it out of the GUI with SLURM, or if I run it directly when ssh-ing onto the GPU.
Job options: Full command:
Error message:
As I said, I can successfully run other relion jobs on the A100 with 4 gpus, and I also can run Model-Angelo with 1 Gpu. I also checked : (and nvidia-smi)
Any ideas? any help greatly appreciated!!
Best, Greta