3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
436 stars 193 forks source link

Segmentation Fault relion_refine_mpi #1154

Open KrisJanssen opened 1 week ago

KrisJanssen commented 1 week ago

Describe your problem

I created a docker image to benchmark running relion 4.0.1-commit-ex417f on multiple hosts in our organization.

For the benchmark, I use a standard dataset: ftp://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz

The dockerfile is here: https://gist.github.com/KrisJanssen/7ff75ad91926e46daa767d71c48f7ced

So far, the resulting container ran fine on any system I threw it at, wheter on-premises or on some of our Azure VMs.

Today, I wanted to test the same image and job on a new on-premise system, ultimately resulting in a segmentation fault.

Environment:

Dataset:

Job options:

Error message:

Please cite the full error message as the example below.

Starting the job:


INFO: MPS server daemon started
INFO: 1436  GB of system memory free, pre-reading images
INFO: Running RELION with:
  8 GPUs
  17 MPI processes total
  2 MPI processes per GPU
  6 threads per worker process
+ mpirun --allow-run-as-root -n 17 --oversubscribe relion_refine_mpi --gpu --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --tau2_fudge 4 --K 6 -
-flatten_solvent --healpix_order 2 --sym C1 --iter 25 --particle_diameter 360 --zero_mask --oversampling 1 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --pool 100 --dont_combine_weights_v
ia_disc --o /host_pwd/run.2024.06.25.22.19 --j 6 --preread_images
+ tee /host_pwd/run.2024.06.25.22.19/log.txt
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           XXX
  Local device:         qedr0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22
[qelr_create_qp:746]create qp: failed on ibv_cmd_create_qp with 22

<tons more of the same 'failed' message

 === RELION MPI setup ===
 + Number of MPI processes             = 17
 + Number of threads per MPI process   = 6
 + Total number of threads therefore   = 102
 + Leader  (0) runs on host            = XXX
 + Follower     1 runs on host            = XXX
 + Follower     2 runs on host            = XXX
 + Follower     3 runs on host            = XXX
 + Follower     4 runs on host            = XXX
 + Follower     5 runs on host            = XXX
 + Follower     6 runs on host            = XXX
 + Follower     7 runs on host            = XXX
 + Follower     8 runs on host            = XXX
 + Follower     9 runs on host            = XXX
 + Follower    10 runs on host            = XXX
 + Follower    11 runs on host            = XXX
 + Follower    12 runs on host            = XXX
 + Follower    13 runs on host            = XXX
 + Follower    14 runs on host            = XXX
 + Follower    15 runs on host            = XXX
 + Follower    16 runs on host            = XXX
 =================
[XXX.dir.ucb-group.com:00235] 33 more processes have sent help me
ssage help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[XXX.dir.ucb-group.com:00235] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 uniqueHost XXX has 16 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
 Thread 1 on follower 1 mapped to device 0
 Thread 2 on follower 1 mapped to device 0
 Thread 3 on follower 1 mapped to device 0
 Thread 4 on follower 1 mapped to device 0
 Thread 5 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 0
 Thread 1 on follower 2 mapped to device 0
 Thread 2 on follower 2 mapped to device 0
 Thread 3 on follower 2 mapped to device 0
 Thread 4 on follower 2 mapped to device 0
 Thread 5 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 1
 Thread 1 on follower 3 mapped to device 1

<a bunch more MPI messages>

Then finally, it all goes pear-shaped:


The following warnings were encountered upon command-line parsing:
WARNING: Option --ctf_corrected_ref     is not a valid RELION argument
 Running CPU instructions in double precision.
WARNING: Particles/shiny_2sets.star seems to be from a previous version of Relion. Attempting conversion...
         You should make sure metadata in the optics group table after conversion is correct.
 Estimating initial noise spectra from 1000 particles
   2/   2 sec ............................................................~~(,_,">
[XXX:00419] *** Process received signal ***
[XXX:00419] Signal: Segmentation fault (11)
[XXX:00419] Signal code: Address not mapped (1)
[XXX:00419] Failing at address: (nil)
[XXX:00419] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f46dda7e420]
[XXX:00419] [ 1] /usr/lib/x86_64-linux-gnu/libibverbs.so.1(ibv_dereg_mr+0xe)[0x7f46dc23003e]
[XXX:00419] [ 2] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x536c7)[0x7f46d6dbe6c7]
[XXX:00419] [ 3] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x758d9)[0x7f46d6de08d9]
[XXX:00419] [ 4] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x759c1)[0x7f46d6de09c1]
[XXX:00419] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x792b5)[0x7f46d6de42b5]
[XXX:00419] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x752d3)[0x7f46d6de02d3]
[XXX:00419] [ 7] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(+0x4c8e)[0x7f46dc248c8e]
[XXX:00419] [ 8] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_pml_cm.so(+0x2958)[0x7f46dc26c958]
[XXX:00419] [ 9] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3c0)[0x7f46dddb9c50]
[XXX:00419] [10] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f46dddba061]
[XXX:00419] [11] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f46d6c23dae]
[XXX:00419] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f46ddd7cb10]
[XXX:00419] [13] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55f6416e5566]
[XXX:00419] [14] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55f6416cbfa8]
[XXX:00419] [15] relion_refine_mpi(main+0x71)[0x55f641683d11]
[XXX:00419] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f46dd4fc083]
[XXX:00419] [17] relion_refine_mpi(_start+0x2e)[0x55f64168758e]
[XXX:00419] *** End of error message ***
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x11b1
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00420] *** Process received signal ***
[XXX:00420] Signal: Aborted (6)
[XXX:00420] Signal code:  (-6)
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x11e1
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00417] *** Process received signal ***
[XXX:00417] Signal: Aborted (6)
[XXX:00417] Signal code:  (-6)
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x121d
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00418] *** Process received signal ***
[XXX:00418] Signal: Aborted (6)
[XXX:00418] Signal code:  (-6)
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x126d
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[qelr_poll_cq_req:2103]Error: POLL CQ with ROCE_CQE_REQ_STS_WORK_REQUEST_FLUSHED_ERR. QP icid=0x129b
relion_refine_mpi: prov/verbs/src/verbs_cq.c:404: fi_ibv_poll_cq: Assertion `wre && (wre->ep || wre->srq)' failed.
[XXX:00447] *** Process received signal ***
[XXX:00447] Signal: Aborted (6)
[XXX:00447] Signal code:  (-6)
[XXX:00435] *** Process received signal ***
[XXX:00435] Signal: Aborted (6)
[XXX:00435] Signal code:  (-6)
[XXX:00420] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7e23b9f420]
[XXX:00420] [XXX:00417] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f10c7de6420]
[XXX:00417] [ 1] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f10c788300b]
[XXX:00417] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f7e2363c00b]
[XXX:00420] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7e2361b859]
[XXX:00420] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f7e2361b729]
[XXX:00420] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f10c7862859]
[XXX:00417] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f10c7862729]
[XXX:00417] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7f10c7873fd6]
[XXX:00417] [ 5] [XXX:00418] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fc06a39b420]
[XXX:00418] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7f7e2362cfd6]
[XXX:00420] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7f7e20f08029]
[XXX:00420] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7f7e20f0844d]
[XXX:00420] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7f7e20f31f59]
[XXX:00420] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7f7e20edf475]
[XXX:00420] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7f10c512d029]
[XXX:00417] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7f10c512d44d]
[XXX:00417] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7f10c5156f59]
[XXX:00417] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7f10c5104475]
[XXX:00417] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7f10c5103b02]
[XXX:00417] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7f10c59900da]
[XXX:00417] [11] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7f7e20edeb02]
[XXX:00420] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7f7e2176b0da]
[XXX:00420] [11] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f7e234b9854]
/usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc069e3800b]
[XXX:00418] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc069e17859]
[XXX:00418] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7fc069e17729]
[XXX:00418] [ 4] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f10c7700854]
[XXX:00417] [12] [XXX:00420] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7f7e23e85905]
[XXX:00420] [13] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7f10c80cc905]
[XXX:00417] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7fc069e28fd6]
[XXX:00418] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7fc06363e029]
[XXX:00418] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7fc06363e44d]
[XXX:00418] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7fc063667f59]
[XXX:00418] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7fc063615475]
[XXX:00418] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7fc063614b02]
[XXX:00418] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7fc0683450da]
[XXX:00418] [11] [XXX:00447] [ 0] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e4)[0x7f7e23edac74]
[XXX:00420] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f7e23edb061]
[XXX:00420] [15] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x26f)[0x7f10c8121aff]
[XXX:00417] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f10c8122061]
[XXX:00417] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7fc069cb5854]
[XXX:00418] [12] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f561b095420]
[XXX:00447] [ 1] [XXX:00435] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7faedc1db420]
[XXX:00435] [ 1] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f7e20d47dae]
[XXX:00420] [16] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f10c4f6cdae]
[XXX:00417] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7faedbc7800b]
[XXX:00435] [ 2] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait_all+0xe5)[0x7fc06a681e25]
[XXX:00418] [13] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x473)[0x7fc06a6d6d03]
[XXX:00418] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f10c80e4b10]
[XXX:00417] [17] /usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f7e23e9db10]
[XXX:00420] [17] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7fc06a6d7061]
[XXX:00418] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7fc063522dae]
[XXX:00418] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f561ab3200b]
[XXX:00447] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f561ab11859]
[XXX:00447] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7f561ab11729]
[XXX:00447] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7faedbc57859]
[XXX:00435] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x22729)[0x7faedbc57729]
[XXX:00435] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7faedbc68fd6]
[XXX:00435] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7faed9522029]
[XXX:00435] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7faed952244d]
relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55e1ef47b566]
[XXX:00420] [18] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55dcd5e5e566]
[XXX:00417] [18] [XXX:00435] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7faed954bf59]
[XXX:00435] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7faed94f9475]
[XXX:00435] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7faed94f8b02]
[XXX:00435] [10] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7faed9d850da]
[XXX:00435] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x33fd6)[0x7f561ab22fd6]
[XXX:00447] [ 5] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x51029)[0x7f56183dc029]
[XXX:00447] [ 6] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x5144d)[0x7f56183dc44d]
[XXX:00447] [ 7] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x7af59)[0x7f5618405f59]
[XXX:00447] [ 8] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x28475)[0x7f56183b3475]
[XXX:00447] [ 9] /usr/lib/x86_64-linux-gnu/libfabric.so.1(+0x27b02)[0x7f56183b2b02]
[XXX:00447] [10] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55e1ef461fa8]
[XXX:00420] [19] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_mtl_ofi.so(ompi_mtl_ofi_progress_no_inline+0xba)[0x7f5618c3f0da]
[XXX:00447] [11] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7f561a9af854]
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7fc06a699b10]
[XXX:00418] [17] /usr/lib/x86_64-linux-gnu/libopen-pal.so.40(opal_progress+0x34)[0x7faedbaf5854]
[XXX:00435] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7faedc4c1905]
[XXX:00435] [13] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55dcd5e44fa8]
[XXX:00417] [19] [XXX:00447] [12] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_request_default_wait+0x135)[0x7f561b37b905]
[XXX:00447] [13] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x55c57ef5e566]
[XXX:00418] [18] relion_refine_mpi(main+0x71)[0x55e1ef419d11]
[XXX:00420] [20] relion_refine_mpi(main+0x71)[0x55dcd5dfcd11]
[XXX:00417] [20] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7e2361d083]
[XXX:00420] [21] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e4)[0x7faedc516c74]
[XXX:00435] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7faedc517061]
[XXX:00435] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7faed9361dae]
[XXX:00435] [16] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x55c57ef44fa8]
[XXX:00418] [19] relion_refine_mpi(main+0x71)[0x55c57eefcd11]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f10c7864083]
[XXX:00417] [21] relion_refine_mpi(_start+0x2e)[0x55dcd5e0058e]
[XXX:00417] *** End of error message ***
[XXX:00418] [20] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc069e19083]
[XXX:00418] [21] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x3e4)[0x7f561b3d0c74]
[XXX:00447] [14] /usr/lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_bcast_intra_pipeline+0xd1)[0x7f561b3d1061]
[XXX:00447] [15] /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x12e)[0x7f561821bdae]
[XXX:00447] [16] relion_refine_mpi(_start+0x2e)[0x55e1ef41d58e]
[XXX:00420] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7faedc4d9b10]
[XXX:00435] [17] relion_refine_mpi(_start+0x2e)[0x55c57ef0058e]
[XXX:00418] *** End of error message ***
/usr/lib/x86_64-linux-gnu/libmpi.so.40(MPI_Bcast+0x120)[0x7f561b393b10]
[XXX:00447] [17] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x562017e91566]
[XXX:00435] [18] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x562017e77fa8]
[XXX:00435] [19] relion_refine_mpi(main+0x71)[0x562017e2fd11]
[XXX:00435] [20] relion_refine_mpi(_ZN7MpiNode16relion_MPI_BcastEPvlP15ompi_datatype_tiP19ompi_communicator_t+0x176)[0x557e5d590566]
[XXX:00447] [18] relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x178)[0x557e5d576fa8]
[XXX:00447] [19] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7faedbc59083]
[XXX:00435] [21] relion_refine_mpi(_start+0x2e)[0x562017e3358e]
[XXX:00435] *** End of error message ***
relion_refine_mpi(main+0x71)[0x557e5d52ed11]
[XXX:00447] [20] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f561ab13083]
[XXX:00447] [21] relion_refine_mpi(_start+0x2e)[0x557e5d53258e]
[XXX:00447] *** End of error message ***
KrisJanssen commented 3 days ago

No ideas from the community?