Open DrJesseHansen opened 2 months ago
I am experiencing a similar issue however I'm working with subvolumes extracted in Windows Warp. I also sometimes receive similar errors to #1179 depending on the parameters. I thought this was a problem with my data or outlier particles but today I found that the same dataset that fails in RELION5 runs fine in RELION4 with the same 3D auto-refine settings.
My RELION5 error is below.
[della-mol:3626086:0:3626229] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1459e9877000)
==== backtrace (tid:3626229) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x1463cfe7607c]
1 /lib64/libucs.so.0(+0x3125c) [0x1463cfe7625c]
2 /lib64/libucs.so.0(+0x3142a) [0x1463cfe7642a]
3 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88) [0x6f98e8]
4 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x7756a9]
5 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x779be6]
6 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2) [0x77b332]
7 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f) [0x70767f]
8 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x7076f5]
9 /usr/lib64/libgomp.so.1(+0x1b4be) [0x1463e4bcb4be]
10 /usr/lib64/libpthread.so.0(+0x81ca) [0x1463e570f1ca]
11 /usr/lib64/libc.so.6(clone+0x43) [0x1463e45fb8d3]
=================================
[della-mol:3626086] *** Process received signal ***
[della-mol:3626086] Signal: Segmentation fault (11)
[della-mol:3626086] Signal code: (-6)
[della-mol:3626086] Failing at address: 0x57ed500375466
[della-mol:3626086] [ 0] /usr/lib64/libpthread.so.0(+0x12d10)[0x1463e5719d10]
[della-mol:3626086] [ 1] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88)[0x6f98e8]
[della-mol:3626086] [ 2] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x7756a9]
[della-mol:3626086] [ 3] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x779be6]
[della-mol:3626086] [ 4] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x77b332]
[della-mol:3626086] [ 5] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x70767f]
[della-mol:3626086] [ 6] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x7076f5]
[della-mol:3626086] [ 7] /usr/lib64/libgomp.so.1(+0x1b4be)[0x1463e4bcb4be]
[della-mol:3626086] [ 8] /usr/lib64/libpthread.so.0(+0x81ca)[0x1463e570f1ca]
[della-mol:3626086] [ 9] /usr/lib64/libc.so.6(clone+0x43)[0x1463e45fb8d3]
[della-mol:3626086] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[della-mol:3626087:0:3626218] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x149ca13ed000)
==== backtrace (tid:3626218) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x14a69fa6a07c]
1 /lib64/libucs.so.0(+0x3125c) [0x14a69fa6a25c]
2 /lib64/libucs.so.0(+0x3142a) [0x14a69fa6a42a]
3 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88) [0x6f98e8]
4 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x772790]
5 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x779482]
6 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2) [0x77b332]
7 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f) [0x70767f]
8 /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x7076f5]
9 /usr/lib64/libgomp.so.1(+0x1b4be) [0x14a6b48994be]
10 /usr/lib64/libpthread.so.0(+0x81ca) [0x14a6b53dd1ca]
11 /usr/lib64/libc.so.6(clone+0x43) [0x14a6b42c98d3]
=================================
[della-mol:3626087] *** Process received signal ***
[della-mol:3626087] Signal: Segmentation fault (11)
[della-mol:3626087] Signal code: (-6)
[della-mol:3626087] Failing at address: 0x57ed500375467
[della-mol:3626087] [ 0] /usr/lib64/libpthread.so.0(+0x12d10)[0x14a6b53e7d10]
[della-mol:3626087] [ 1] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88)[0x6f98e8]
[della-mol:3626087] [ 2] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x772790]
[della-mol:3626087] [ 3] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x779482]
[della-mol:3626087] [ 4] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x77b332]
[della-mol:3626087] [ 5] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x70767f]
[della-mol:3626087] [ 6] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x7076f5]
[della-mol:3626087] [ 7] /usr/lib64/libgomp.so.1(+0x1b4be)[0x14a6b48994be]
[della-mol:3626087] [ 8] /usr/lib64/libpthread.so.0(+0x81ca)[0x14a6b53dd1ca]
[della-mol:3626087] [ 9] /usr/lib64/libc.so.6(clone+0x43)[0x14a6b42c98d3]
[della-mol:3626087] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node della-mol exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Hi,
I am running 3d auto refine on 2D particles from tomograms (tomo pipeline with extracting 2D particles). I have stayed within the RELION pipeline and indeed everything works well. No issues. However, I am also running the same dataset though the new Linux Warp pipelines in parallel. I extract the 2D particles in WARP and when I run any job in RELION I get the segmentation error below. I've tried 3D classification with 1 class and 3D autorefine. I've tried reducing memory requirements as much as possible: pad set to 1, translational search of only 2 pixels, and reduced the mpi to only 2 processes. See my command below. I have 60k particles, the box size is 40x40. I am running RELION 5 -- beta 3.
This is running on a cluster compute environment on two Nvidia H100 (SXM5 80GB) so I think GPU memory should not be an issue. I have allocated 200GB CPU memory and am measuring CPU memory during the job: it never goes over 90GB or so. I am perplexed why this is happening. I checked the image stats for the output particles and they are both the same map mode (flaot16) but of course the min/max are way different, due to WARP vs RELION extraction. Could this be the issue? Any idea what might be causing this?
My command is below:
The error I am receiving:
Thanks!