semop lock error during 3D classification

DrJesseHansen commented 3 months ago

Running this command interactively on a GPU node with two 2080Ti cards. This same error occurs when submiting to slurm cluster on our HPC.

running Relion 5 beta 3 commit 6331fe

command:

mpirun --np 5 --oversubscribe relion_refine_mpi --o Class3D/job055/run --ios Extract/job025/optimisation_set.star --gpu "" --ref InitialModel/box40_bin8_invert.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 440 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 1 --pipeline_control Class3D/job055/

error:


Expectation iteration 1 of 25
000/??? sec ~~(,_,">                                                          [oo]^Cjhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ^C
jhansen@gpu148:/mnt/beegfs/schurgrp/jhansen/HTT/RELION5$ ./07_classify1class.job 
RELION version: 5.0-beta-3-commit-6331fe 
Precision: BASE=double, CUDA-ACC=single 

 === RELION MPI setup ===
 + Number of MPI processes                 = 5
 + Leader      (0) runs on host            = gpu148
 + Follower     1  runs on host            = gpu148
 + Follower     2  runs on host            = gpu148
 + Follower     3  runs on host            = gpu148
 + Follower     4  runs on host            = gpu148
 ==========================
 uniqueHost gpu148 has 4 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 1 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 2 mapped to device 0
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 3 mapped to device 1
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on follower 4 mapped to device 1
Device 0 on gpu148 is split between 2 followers
Device 1 on gpu148 is split between 2 followers
 Running CPU instructions in double precision. 
 WARNING:  The reference pixel size is 1 A/px, but the pixel size of the first optics group of the data is 11.056 A/px! 
WARNING: Although the requested resized pixel size is 11.056 A/px, the actual resized pixel size of the reference will be 10 A/px due to rounding of the box size to an even number. 
WARNING: Resizing input reference(s) to pixel_size= 10 and box size= 40 ...
 Estimating initial noise spectra from at most 10 particles 
   0/   0 sec ............................................................~~(,_,">
 CurrentResolution= 57.1429 Angstroms, which requires orientationSampling of at least 14.4 degrees for a particle of diameter 440 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 373248
 OrientationalSampling= 15 NrOrientations= 4608
 TranslationalSampling= 20 NrTranslations= 81
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 23887872
 OrientationalSampling= 7.5 NrOrientations= 36864
 TranslationalSampling= 10 NrTranslations= 648
=============================
 Expectation iteration 1 of 25
4.30/4.30 hrs ............................................................~~(,_,">
 Maximization...
   0/   0 sec ............................................................~~(,_,">
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
in: /dev/shm/schloegl-src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
=== Backtrace  ===
=== Backtrace  ===
=== Backtrace  ===
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x55bb0ec3942a]
relion_refine_mpi(+0x5e60c) [0x55bb0eb8f60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x55bb0ee2cabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x55bb0ee48a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x55bb0ec60069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x55bb0ec7710c]
relion_refine_mpi(main+0x52) [0x55bb0ec249c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f14bdc4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f14bdc46305]
relion_refine_mpi(_start+0x21) [0x55bb0ec28251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56136c52842a]
relion_refine_mpi(+0x5e60c) [0x56136c47e60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56136c71babb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56136c737a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56136c54f069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56136c56610c]
relion_refine_mpi(main+0x52) [0x56136c5139c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7f266aa4624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7f266aa46305]
relion_refine_mpi(_start+0x21) [0x56136c517251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56542492742a]
relion_refine_mpi(+0x5e60c) [0x56542487d60c]
relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x565424b1aabb]
relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x565424b36a2c]
relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e9) [0x56542494e069]
relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xbc) [0x56542496510c]
relion_refine_mpi(main+0x52) [0x5654249129c2]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x7faedd64624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7faedd646305]
relion_refine_mpi(_start+0x21) [0x565424916251]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu148:295268] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu148:295268] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

This is a template for reporting bugs. Please fill in as much information as you can.

Describe your problem

Please write a clear description of what the problem is. Data processing questions should be posted to the CCPEM mailing list, not here. DO NOT cross post a same question to multiple issues and/or many mailing lists (CCPEM, 3DEM, etc).

Environment:

OS: [e.g. Ubuntu 16.04 LTS]
MPI runtime: [e.g. OpenMPI 2.0.1]
RELION version [e.g. RELION-3.1-devel-commit-6ba935 (please see the title bar of the GUI)]
Memory: [e.g. 128 GB]
GPU: [e.g. GTX 1080Ti]

Dataset:

Box size: [e.g. 256 px]
Pixel size: [e.g. 0.9 Å/px]
Number of particles: [e.g. 150,000]
Description: [e.g. A tetrameric protein of about 400 kDa in total]

Job options:

Type of job: [e.g. Refine3D]
Number of MPI processes: [e.g. 4]
Number of threads: [e.g. 6]

Full command (see note.txt in the job directory):

`which relion_refine_mpi` --o Refine3D/job019/run --auto_refine --split_random_halves --i CtfRefine/job018/particles_ctf_refine.star --ref PostProcess/job001/postprocess.mrc --firstiter_cc --ini_high 12 --dont_combine_weights_via_disc --scratch_dir /ssd --pool 3 --pad 2  --ctf --ctf_corrected_ref --particle_diameter 142 --flatten_solvent --zero_mask --solvent_mask Result-by-Rado/run_class001_mask_th0.01_ns3_ngs7_box400.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 3 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym O --low_resol_join_halves 40 --norm --scale  --j 8 --gpu "" --keep_scratch --pipeline_control Refine3D/job019/

Error message:

Please cite the full error message as the example below.

A line in the STAR file contains fewer columns than the number of labels. Expected = 3 Found = 2
Error in line: 0 0.0
in: /prog/relion-devel-lmb/src/metadata_table.cpp, line 966
=== Backtrace  ===
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x42e981]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable12readStarLoopERSt14basic_ifstreamIcSt11char_traitsIcEEPSt6vectorI8EMDLabelSaIS6_EESsb+0xedd) [0x4361ad]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MetaDataTable8readStarERSt14basic_ifstreamIcSt11char_traitsIcEERKSsPSt6vectorI8EMDLabelSaIS8_EESsb+0x580) [0x436f10]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10Micrograph4readE8FileNameb+0x5a3) [0x454bb3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN10MicrographC2E8FileNameS0_d+0x2e3) [0x4568b3]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler14isMoviePresentERK13MetaDataTableb+0x180) [0x568280]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN17MicrographHandler17cullMissingMoviesERKSt6vectorI13MetaDataTableSaIS1_EEi+0xe6) [0x568dc6]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(_ZN13MotionRefiner4initEv+0x56f) [0x49e1ff]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi(main+0x31) [0x42a5e1]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b7ac026e495]
/prog/relion-devel-lmb/bin/relion_motion_refine_mpi() [0x42b3cf]
==================

biochem-fan commented 3 months ago

Did you read and try suggestions in #738?

huwjenkins commented 3 months ago

If this is the same issue as described in #738 this is caused by the OS destroying the semaphores. Are you logging out of the machine whilst RELION is running? And on the cluster did you log into and then log out of the node the job was running on?

ipcs -s is useful to diagnose what's going on.

@biochem-fan this was not an issue in RELION-4 because this code was omitted due the #ifdef CUDAs in src/projector.cpp being missed in f453d2c137916c4115d3819dc1725bff775e1731 but came back in RELION-5 when they were changed to #ifdef _CUDA_ENABLED in 38f0c4f11629eb5add2b81a11976c239101e9480

DrJesseHansen commented 3 months ago

thanks for the response. #738 suggests adding coarse search option to yes? I had that turned off, I set it to on and am re-running the job. I'll update my post when I know whether it worked.

Regarding the logging in/out: yes I am. Well sortof. I have a GPU node reserved which I log into using turboVNC, so it's a remote access node which runs constantly. The session is running constantly and I login from home/work to check on my job, however to regain access to the node each time I must SSH directly into the nod and reset my password. When it was run on the cluster I don't recall whether I logged into the node but I doubt it. I'll try running it again that way and ensure I do not log in to that node.

I ran ipcs -s, results below. I am runing on two GPUs. Should I run this again if/when the job fails?

------ Semaphore Arrays -------- key semid owner perms nsems
0x8ba79abb 2 jhansen 666 1
0x8ba79aba 3 jhansen 666 1

huwjenkins commented 3 months ago

Should I run this again if/when the job fails?

You should try running this when you "login from home/work to check on my job". In the case I saw in #738 what I did was:

1) submit job from the workstation with a delayed start (via SLURM). 2) log out of all sessions on the workstation 3) log back in once the job started and verify the semaphores were present 4) log out and log back in and see that the semaphores were destroyed 5) wait for the job to crash 6) repeat 1 and 2 but never log in whilst the job was running - result was successful completion of the job.

The workaround was to use screen to keep a session open on the workstation. But I'm surprised this is also happening on a cluster because you're unlikely to ever log into the node where the job is running.

huwjenkins commented 3 months ago

If you have admin rights you could also test if adding RemoveIPC=no in logind.conf and restarting logind fixes it. See https://github.com/systemd/systemd/issues/2039#issuecomment-279235692

DrJesseHansen commented 3 months ago

update: it made it to the second iteration! So adding in coarse search option made a difference. Nice.

DrJesseHansen commented 3 months ago

update. this just happened again. Same dataset, this time during ab initio. It got to iteration 113 then crashed. See below.

I definitely did NOT log into the node this time as it was processing.

Gradient optimisation iteration 112 of 200 with 3457 particles (Step size 0.5)
50.52/50.52 min ............................................................~~(,_,">
 Maximization...
   0/   0 sec ............................................................~~(,_,">
 CurrentResolution= 23.2758 Angstroms, which requires orientationSampling of at least 6.66667 degrees for a particle of diameter 400 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 7483392
 OrientationalSampling= 7.5 NrOrientations= 36864
 TranslationalSampling= 3 NrTranslations= 203
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 478937088
 OrientationalSampling= 3.75 NrOrientations= 294912
 TranslationalSampling= 1.5 NrTranslations= 1624
=============================
 Gradient optimisation iteration 113 of 200 with 3518 particles (Step size 0.5)
52.13/52.13 min ............................................................~~(,_,">
 Maximization...
   0/   0 sec ............................................................~~(,_,">
in: /dev/shm/src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR: 
semop lock error
=== Backtrace  ===
relion_refine(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56442ac3fd2a]
relion_refine(+0x7050c) [0x56442abb350c]
relion_refine(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56442ae8873b]
relion_refine(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56442ac696ac]
relion_refine(_ZN11MlOptimiser11expectationEv+0x21) [0x56442acae421]
relion_refine(_ZN11MlOptimiser7iterateEv+0x86) [0x56442acbfc26]
relion_refine(main+0x3c) [0x56442ac2de4c]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x14c9249fe24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x14c9249fe305]
relion_refine(_start+0x21) [0x56442ac31671]
==================
ERROR: 
semop lock error

 RELION version: 5.0-beta-3-commit-6331fe
 exiting with an error ...

DrJesseHansen commented 3 months ago

this is happening now for all of my refine jobs. At least I can get to about iteration 10 before it crashes. Any idea what is causing this?

3dem / relion

semop lock error during 3D classification #1177