Open DrJesseHansen opened 2 months ago
Did you read and try suggestions in #738?
If this is the same issue as described in #738 this is caused by the OS destroying the semaphores. Are you logging out of the machine whilst RELION is running? And on the cluster did you log into and then log out of the node the job was running on?
ipcs -s
is useful to diagnose what's going on.
@biochem-fan this was not an issue in RELION-4 because this code was omitted due the #ifdef CUDA
s in src/projector.cpp
being missed in f453d2c137916c4115d3819dc1725bff775e1731 but came back in RELION-5 when they were changed to #ifdef _CUDA_ENABLED
in 38f0c4f11629eb5add2b81a11976c239101e9480
thanks for the response. #738 suggests adding coarse search option to yes? I had that turned off, I set it to on and am re-running the job. I'll update my post when I know whether it worked.
Regarding the logging in/out: yes I am. Well sortof. I have a GPU node reserved which I log into using turboVNC, so it's a remote access node which runs constantly. The session is running constantly and I login from home/work to check on my job, however to regain access to the node each time I must SSH directly into the nod and reset my password. When it was run on the cluster I don't recall whether I logged into the node but I doubt it. I'll try running it again that way and ensure I do not log in to that node.
I ran ipcs -s, results below. I am runing on two GPUs. Should I run this again if/when the job fails?
------ Semaphore Arrays --------
key semid owner perms nsems
0x8ba79abb 2 jhansen 666 1
0x8ba79aba 3 jhansen 666 1
Should I run this again if/when the job fails?
You should try running this when you "login from home/work to check on my job". In the case I saw in #738 what I did was:
1) submit job from the workstation with a delayed start (via SLURM). 2) log out of all sessions on the workstation 3) log back in once the job started and verify the semaphores were present 4) log out and log back in and see that the semaphores were destroyed 5) wait for the job to crash 6) repeat 1 and 2 but never log in whilst the job was running - result was successful completion of the job.
The workaround was to use screen
to keep a session open on the workstation. But I'm surprised this is also happening on a cluster because you're unlikely to ever log into the node where the job is running.
If you have admin rights you could also test if adding RemoveIPC=no
in logind.conf
and restarting logind
fixes it. See https://github.com/systemd/systemd/issues/2039#issuecomment-279235692
update: it made it to the second iteration! So adding in coarse search option made a difference. Nice.
update. this just happened again. Same dataset, this time during ab initio. It got to iteration 113 then crashed. See below.
I definitely did NOT log into the node this time as it was processing.
Gradient optimisation iteration 112 of 200 with 3457 particles (Step size 0.5)
50.52/50.52 min ............................................................~~(,_,">
Maximization...
0/ 0 sec ............................................................~~(,_,">
CurrentResolution= 23.2758 Angstroms, which requires orientationSampling of at least 6.66667 degrees for a particle of diameter 400 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 7483392
OrientationalSampling= 7.5 NrOrientations= 36864
TranslationalSampling= 3 NrTranslations= 203
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 478937088
OrientationalSampling= 3.75 NrOrientations= 294912
TranslationalSampling= 1.5 NrTranslations= 1624
=============================
Gradient optimisation iteration 113 of 200 with 3518 particles (Step size 0.5)
52.13/52.13 min ............................................................~~(,_,">
Maximization...
0/ 0 sec ............................................................~~(,_,">
in: /dev/shm/src-relion-5-beta6-KaMZkjUz/relion/src/projector.cpp, line 208
ERROR:
semop lock error
=== Backtrace ===
relion_refine(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x6a) [0x56442ac3fd2a]
relion_refine(+0x7050c) [0x56442abb350c]
relion_refine(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x81b) [0x56442ae8873b]
relion_refine(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x56442ac696ac]
relion_refine(_ZN11MlOptimiser11expectationEv+0x21) [0x56442acae421]
relion_refine(_ZN11MlOptimiser7iterateEv+0x86) [0x56442acbfc26]
relion_refine(main+0x3c) [0x56442ac2de4c]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a) [0x14c9249fe24a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x14c9249fe305]
relion_refine(_start+0x21) [0x56442ac31671]
==================
ERROR:
semop lock error
RELION version: 5.0-beta-3-commit-6331fe
exiting with an error ...
this is happening now for all of my refine jobs. At least I can get to about iteration 10 before it crashes. Any idea what is causing this?
Running this command interactively on a GPU node with two 2080Ti cards. This same error occurs when submiting to slurm cluster on our HPC.
running Relion 5 beta 3 commit 6331fe
command:
mpirun --np 5 --oversubscribe relion_refine_mpi --o Class3D/job055/run --ios Extract/job025/optimisation_set.star --gpu "" --ref InitialModel/box40_bin8_invert.mrc --firstiter_cc --trust_ref_size --ini_high 60 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --iter 25 --tau2_fudge 1 --particle_diameter 440 --K 1 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 1 --pipeline_control Class3D/job055/
error:
This is a template for reporting bugs. Please fill in as much information as you can.
Describe your problem
Please write a clear description of what the problem is. Data processing questions should be posted to the CCPEM mailing list, not here. DO NOT cross post a same question to multiple issues and/or many mailing lists (CCPEM, 3DEM, etc).
Environment:
Dataset:
Job options:
note.txt
in the job directory):Error message:
Please cite the full error message as the example below.