3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 197 forks source link

relion_tomo_reconstruct_tomogram_mpi requiring more memory than before and then crashing #1151

Open rdrighetto opened 2 months ago

rdrighetto commented 2 months ago

Describe your problem

Reconstructing tomograms, which worked fine in a previous version (RELION-5.0-beta-3-commit-63311fe) now seems to be broken. The first issue I see is that it now requires more RAM. Something that I could run before with 128 GB easily (perhaps more than enough) now I have to increase to 400 GB to manage, otherwise it runs into OOM errors. Once enough memory is available, it runs into this error:

TomoBackprojectProgram::getCtfCorrectedSNR  BUG: invalid access of newSNR array...

Please see full error message below. I'm trying to re-run something that worked fine in a previous version. I saw there were changes to reconstruct_tomogram.cpp (https://github.com/3dem/relion/commit/c3edb970ccfae9fe13f62f30eadd2497e740c51b) and wanted to compare the results before and after.

Environment:

Dataset:

Job options:

Error message: UPDATE: I edited the error message below to reflect the actual job corresponding to the binned tomogram dimensions above. The error is the same though.

in: /scicore/projects/scicore-p-structsoft/ubuntu/software/RELION/ver5.0/src/jaz/tomography/programs/reconstruct_tomogram.cpp, line 310
ERROR: 
TomoBackprojectProgram::getCtfCorrectedSNR  BUG: invalid access of newSNR array...
terminate called after throwing an instance of 'RelionError'
[scb05:972747] *** Process received signal ***
[scb05:972747] Signal: Aborted (6)
[scb05:972747] Signal code:  (-6)
[scb05:972747] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x14cdd210a520]
[scb05:972747] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x14cdd215e9fc]
[scb05:972747] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x14cdd210a476]
[scb05:972747] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x14cdd20f07f3]
[scb05:972747] [ 4] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xa9a49)[0x14cdd2481a49]
[scb05:972747] [ 5] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb4e6a)[0x14cdd248ce6a]
[scb05:972747] [ 6] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xb3ed9)[0x14cdd248bed9]
[scb05:972747] [ 7] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libstdc++.so.6(__gxx_personality_v0+0x86)[0x14cdd248c5f6]
[scb05:972747] [ 8] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libgcc_s.so.1(+0x17864)[0x14cde3945864]
[scb05:972747] [ 9] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x14cde39462bd]
[scb05:972747] [10] /scicore/projects/scicore-p-structsoft/ubuntu/software/RELION/ver5.0/build_amdfftw/bin/relion_tomo_reconstruct_tomogram_mpi[0x44b974]
[scb05:972747] [11] /scicore/projects/scicore-p-structsoft/ubuntu/software/RELION/ver5.0/build_amdfftw/bin/relion_tomo_reconstruct_tomogram_mpi[0x4feef6]
[scb05:972747] [12] /scicore/soft/easybuild/apps/GCCcore/12.3.0/lib64/libgomp.so.1(+0x1e45e)[0x14cdd8a3145e]
[scb05:972747] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x14cdd215cac3]
[scb05:972747] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x14cdd21ee850]
[scb05:972747] *** End of error message ***
Command terminated by signal 6
133.84user 42.36system 0:45.40elapsed 388%CPU (0avgtext+0avgdata 41107392maxresident)k
srun: error: scb05: task 4: Exited with exit code 134
16848inputs+0outputs (2716major+40074934minor)pagefaults 0swaps
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 442534 ON scb04 CANCELLED AT 2024-06-21T19:05:10 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 442534.0 ON scb04 CANCELLED AT 2024-06-21T19:05:10 DUE TO TIME LIMIT ***
srun: got SIGCONT
srun: forcing job termination
rdrighetto commented 2 months ago

Just confirmed that relion_tomo_reconstruct_tomogram_mpi works again when reverting to 6331fe600cca7683ecf7c1011ce676701faf1e97 with exactly the same settings.

rdrighetto commented 1 month ago

Another update: using the latest 96f798eb8a7eafa7e6d27c8b32711aad701cdfa2 on branch ver5.0, relion_tomo_reconstruct_tomogram_mpi works as before if I say NO to "Fourier inversion with odd/even frames?", but crashes with the error above if I say YES. I'm very curious to see how tomos look like using this odd/even based SNR estimation...