Please cite the full error message as the example below.
Starting the job:
INFO: MPS server daemon started
INFO: 1436 GB of system memory free, pre-reading images
INFO: Running RELION with:
8 GPUs
17 MPI processes total
2 MPI processes per GPU
6 threads per worker process
+ mpirun --allow-run-as-root -n 17 --oversubscribe relion_refine_mpi --gpu --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --tau2_fudge 4 --K 6 -
-flatten_solvent --healpix_order 2 --sym C1 --iter 25 --particle_diameter 360 --zero_mask --oversampling 1 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --pool 100 --dont_combine_weights_v
ia_disc --o /host_pwd/run.2024.06.25.22.19 --j 6 --preread_images
+ tee /host_pwd/run.2024.06.25.22.19/log.txt
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: XXX
Local device: qedr0
Local port: 1
CPCs attempted: udcm
--------------------------------------------------------------------------
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
<tons more of the same 'failed' message
=== RELION MPI setup ===
+ Number of MPI processes = 17
+ Number of threads per MPI process = 6
+ Total number of threads therefore = 102
+ Leader (0) runs on host = XXX
+ Follower 1 runs on host = XXX
+ Follower 2 runs on host = XXX
+ Follower 3 runs on host = XXX
+ Follower 4 runs on host = XXX
+ Follower 5 runs on host = XXX
+ Follower 6 runs on host = XXX
+ Follower 7 runs on host = XXX
+ Follower 8 runs on host = XXX
+ Follower 9 runs on host = XXX
+ Follower 10 runs on host = XXX
+ Follower 11 runs on host = XXX
+ Follower 12 runs on host = XXX
+ Follower 13 runs on host = XXX
+ Follower 14 runs on host = XXX
+ Follower 15 runs on host = XXX
+ Follower 16 runs on host = XXX
=================
[XXX.dir.ucb-group.com:00235] 33 more processes have sent help me
ssage help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[XXX.dir.ucb-group.com:00235] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
uniqueHost XXX has 16 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 1 mapped to device 0
Thread 1 on follower 1 mapped to device 0
Thread 2 on follower 1 mapped to device 0
Thread 3 on follower 1 mapped to device 0
Thread 4 on follower 1 mapped to device 0
Thread 5 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 2 mapped to device 0
Thread 1 on follower 2 mapped to device 0
Thread 2 on follower 2 mapped to device 0
Thread 3 on follower 2 mapped to device 0
Thread 4 on follower 2 mapped to device 0
Thread 5 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
Thread 0 on follower 3 mapped to device 1
Thread 1 on follower 3 mapped to device 1
<a bunch more MPI messages>
Describe your problem
I created a docker image to benchmark running relion 4.0.1-commit-ex417f on multiple hosts in our organization.
For the benchmark, I use a standard dataset: ftp://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz
The dockerfile is here: https://gist.github.com/KrisJanssen/7ff75ad91926e46daa767d71c48f7ced
So far, the resulting container ran fine on any system I threw it at, wheter on-premises or on some of our Azure VMs.
Today, I wanted to test the same image and job on a new on-premise system, ultimately resulting in a segmentation fault.
Environment:
Dataset:
Job options:
Error message:
Please cite the full error message as the example below.
Starting the job:
Then finally, it all goes pear-shaped: