3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
456 stars 203 forks source link

Segmentation fault: address not mapped to object at address 0x80 #1181

Open DrJesseHansen opened 2 months ago

DrJesseHansen commented 2 months ago

Hi,

I am running 3d auto refine on 2D particles from tomograms (tomo pipeline with extracting 2D particles). I have stayed within the RELION pipeline and indeed everything works well. No issues. However, I am also running the same dataset though the new Linux Warp pipelines in parallel. I extract the 2D particles in WARP and when I run any job in RELION I get the segmentation error below. I've tried 3D classification with 1 class and 3D autorefine. I've tried reducing memory requirements as much as possible: pad set to 1, translational search of only 2 pixels, and reduced the mpi to only 2 processes. See my command below. I have 60k particles, the box size is 40x40. I am running RELION 5 -- beta 3.

This is running on a cluster compute environment on two Nvidia H100 (SXM5 80GB) so I think GPU memory should not be an issue. I have allocated 200GB CPU memory and am measuring CPU memory during the job: it never goes over 90GB or so. I am perplexed why this is happening. I checked the image stats for the output particles and they are both the same map mode (flaot16) but of course the min/max are way different, due to WARP vs RELION extraction. Could this be the issue? Any idea what might be causing this?

My command is below:

#!/bin/bash
#SBATCH --ntasks=3
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=239:00:00
#SBATCH --mem=200G
#SBATCH --partition=gpu100
#SBATCH --gres=gpu:2
#SBATCH --export=NONE

cd $SLURM_SUBMIT_DIR

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module purge
module load relion/5-beta6
unset SLURM_EXPORT_ENV

# Create necessary directories
mkdir -p Refine3D/job001_local_3_redo

# Run Relion refine process with MPI
mpirun -n 3 `which relion_refine_mpi` \
--o Refine3D/job001_local_3_redo/run \
--auto_refine \
--split_random_halves \
--firstiter_cc \
--ios reextracted_bin8_3D_optimisation_set.star \
--ref InitialModel/recon.mrc \
--trust_ref_size \
--ini_high 40 \
--dont_combine_weights_via_disc \
--pool 10 \
--pad 1  \
--ctf \
--particle_diameter 400 \
--flatten_solvent \
--zero_mask \
--oversampling 1 \
--healpix_order 3 \
--auto_local_healpix_order 3 \
--offset_range 2 \
--offset_step 2 \
--sym C1 \
--low_resol_join_halves 40 \
--norm \
--scale  \
--j 1 \
--gpu ""   

The error I am receiving:

Auto-refine: Iteration= 1
 Auto-refine: Resolution= 40.2036 (no gain for 0 iter) 
 Auto-refine: Changes in angles= 999 degrees; and in offsets= 999 Angstroms (no gain for 0 iter) 
 Estimating accuracies in the orientational assignment ... 
   3/   3 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 1.484 degrees; offsets= 3.89171 Angstroms
 CurrentResolution= 40.2036 Angstroms, which requires orientationSampling of at least 11.25 degrees for a particle of diameter 400 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 945
 OrientationalSampling= 7.5 NrOrientations= 135
 TranslationalSampling= 22.112 NrTranslations= 7
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 60480
 OrientationalSampling= 3.75 NrOrientations= 1080
 TranslationalSampling= 11.056 NrTranslations= 56
=============================
 Expectation iteration 1
7.45/40.35 min ...........~~(,_,">[gpu271:3904135:0:3904135] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x80)
==== backtrace (tid:3904135) ====
 0 0x000000000003c050 __sigaction()  ???:0
 1 0x00000000003d58ff getAllSquaredDifferencesCoarse<MlOptimiserCuda>()  tmpxft_003a2465_00000000-6_cuda_ml_optimiser.cudafe1.cpp:0
 2 0x00000000003d9fc4 accDoExpectationOneParticle<MlOptimiserCuda>()  tmpxft_003a2465_00000000-6_cuda_ml_optimiser.cudafe1.cpp:0
 3 0x00000000003db852 MlOptimiserCuda::doThreadExpectationSomeParticles()  ???:0
 4 0x000000000036b96f globalThreadExpectationSomeParticles()  ???:0
 5 0x000000000036b9e5 MlOptimiser::expectationSomeParticles()  ml_optimiser.cpp:0
 6 0x00000000000140b6 GOMP_parallel()  ???:0
 7 0x0000000000358a6e MlOptimiser::expectationSomeParticles()  ???:0
 8 0x0000000000130bad MlOptimiserMpi::expectation()  ???:0
 9 0x000000000014610c MlOptimiserMpi::iterate()  ???:0
10 0x00000000000f39c2 main()  ???:0
11 0x000000000002724a __libc_init_first()  ???:0
12 0x0000000000027305 __libc_start_main()  ???:0
13 0x00000000000f7251 _start()  ???:0
=================================
[gpu271:3904135] *** Process received signal ***
[gpu271:3904135] Signal: Segmentation fault (11)
[gpu271:3904135] Signal code:  (-6)
[gpu271:3904135] Failing at address: 0xf57ae003b9287
[gpu271:3904135] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x14e786e13050]
[gpu271:3904135] [ 1] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x3d58ff)[0x5581b98b58ff]
[gpu271:3904135] [ 2] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x3d9fc4)[0x5581b98b9fc4]
[gpu271:3904135] [ 3] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x5581b98bb852]
[gpu271:3904135] [ 4] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x5581b984b96f]
[gpu271:3904135] [ 5] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(+0x36b9e5)[0x5581b984b9e5]
[gpu271:3904135] [ 6] /lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x14e786fcc0b6]
[gpu271:3904135] [ 7] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN11MlOptimiser24expectationSomeParticlesEll+0xd5e)[0x5581b9838a6e]
[gpu271:3904135] [ 8] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x1f2d)[0x5581b9610bad]
[gpu271:3904135] [ 9] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc)[0x5581b962610c]
[gpu271:3904135] [10] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(main+0x52)[0x5581b95d39c2]
[gpu271:3904135] [11] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x14e786dfe24a]
[gpu271:3904135] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x14e786dfe305]
[gpu271:3904135] [13] /mnt/nfs/clustersw/Debian/bookworm/cuda/12.0/relion/5-beta6/bin/relion_refine_mpi(_start+0x21)[0x5581b95d7251]
[gpu271:3904135] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3904135 on node gpu271 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Thanks!

ryanfeathers commented 2 weeks ago

I am experiencing a similar issue however I'm working with subvolumes extracted in Windows Warp. I also sometimes receive similar errors to #1179 depending on the parameters. I thought this was a problem with my data or outlier particles but today I found that the same dataset that fails in RELION5 runs fine in RELION4 with the same 3D auto-refine settings.

My RELION5 error is below.

[della-mol:3626086:0:3626229] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1459e9877000)
==== backtrace (tid:3626229) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x1463cfe7607c]
 1  /lib64/libucs.so.0(+0x3125c) [0x1463cfe7625c]
 2  /lib64/libucs.so.0(+0x3142a) [0x1463cfe7642a]
 3  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88) [0x6f98e8]
 4  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x7756a9]
 5  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x779be6]
 6  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2) [0x77b332]
 7  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f) [0x70767f]
 8  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x7076f5]
 9  /usr/lib64/libgomp.so.1(+0x1b4be) [0x1463e4bcb4be]
10  /usr/lib64/libpthread.so.0(+0x81ca) [0x1463e570f1ca]
11  /usr/lib64/libc.so.6(clone+0x43) [0x1463e45fb8d3]
=================================
[della-mol:3626086] *** Process received signal ***
[della-mol:3626086] Signal: Segmentation fault (11)
[della-mol:3626086] Signal code:  (-6)
[della-mol:3626086] Failing at address: 0x57ed500375466
[della-mol:3626086] [ 0] /usr/lib64/libpthread.so.0(+0x12d10)[0x1463e5719d10]
[della-mol:3626086] [ 1] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88)[0x6f98e8]
[della-mol:3626086] [ 2] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x7756a9]
[della-mol:3626086] [ 3] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x779be6]
[della-mol:3626086] [ 4] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x77b332]
[della-mol:3626086] [ 5] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x70767f]
[della-mol:3626086] [ 6] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x7076f5]
[della-mol:3626086] [ 7] /usr/lib64/libgomp.so.1(+0x1b4be)[0x1463e4bcb4be]
[della-mol:3626086] [ 8] /usr/lib64/libpthread.so.0(+0x81ca)[0x1463e570f1ca]
[della-mol:3626086] [ 9] /usr/lib64/libc.so.6(clone+0x43)[0x1463e45fb8d3]
[della-mol:3626086] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[della-mol:3626087:0:3626218] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x149ca13ed000)
==== backtrace (tid:3626218) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x14a69fa6a07c]
 1  /lib64/libucs.so.0(+0x3125c) [0x14a69fa6a25c]
 2  /lib64/libucs.so.0(+0x3142a) [0x14a69fa6a42a]
 3  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88) [0x6f98e8]
 4  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x772790]
 5  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x779482]
 6  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2) [0x77b332]
 7  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f) [0x70767f]
 8  /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi() [0x7076f5]
 9  /usr/lib64/libgomp.so.1(+0x1b4be) [0x14a6b48994be]
10  /usr/lib64/libpthread.so.0(+0x81ca) [0x14a6b53dd1ca]
11  /usr/lib64/libc.so.6(clone+0x43) [0x14a6b42c98d3]
=================================
[della-mol:3626087] *** Process received signal ***
[della-mol:3626087] Signal: Segmentation fault (11)
[della-mol:3626087] Signal code:  (-6)
[della-mol:3626087] Failing at address: 0x57ed500375467
[della-mol:3626087] [ 0] /usr/lib64/libpthread.so.0(+0x12d10)[0x14a6b53e7d10]
[della-mol:3626087] [ 1] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN11MlOptimiser42precalculateShiftedImagesCtfsAndInvSigma2sEbbliiiiRSt6vectorI13MultidimArrayI8tComplexIdEESaIS4_EES7_RS0_IS1_IdESaIS8_EER8Matrix1DIdERS0_IS6_SaIS6_EESH_SB_RS0_IdSaIdEERS8_SL_SL_+0xc88)[0x6f98e8]
[della-mol:3626087] [ 2] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x772790]
[della-mol:3626087] [ 3] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x779482]
[della-mol:3626087] [ 4] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xe2)[0x77b332]
[della-mol:3626087] [ 5] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x2f)[0x70767f]
[della-mol:3626087] [ 6] /projects/MOLBIO/local/relion-5.0-beta-4-gcc-13.2.1-cuda-12.4-rhel8-arch80/bin/relion_refine_mpi[0x7076f5]
[della-mol:3626087] [ 7] /usr/lib64/libgomp.so.1(+0x1b4be)[0x14a6b48994be]
[della-mol:3626087] [ 8] /usr/lib64/libpthread.so.0(+0x81ca)[0x14a6b53dd1ca]
[della-mol:3626087] [ 9] /usr/lib64/libc.so.6(clone+0x43)[0x14a6b42c98d3]
[della-mol:3626087] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node della-mol exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------