gersgro commented 3 years ago

The problem: We are 3D refining our polished particles using Relion 3.1, so we are at the end of our processing workflow (or at least getting there), but the step is not finishing correctly but exiting after giving an allocation error. The process is firstly run using GPU to speed up the whole step, and goes nicely until iteration ~20 when an error complaining about "lack of memory" appears. This is quite expected since the box size used is quite big (850 pixels), and we even receive a message saying that in the error log (see below) at the beginning. We have 156 Gb of RAM, a 4 Tb Swap in HD disk and 2 GeForce GTX 1080Ti (11 Gb memory). So, after selecting the last good iteration, we continue without using GPU. The processing continues until a new error appears, now complaining about "lack of space", please see the whole error output below. We really do not know how to solve this issue, because the data is the same as used in previous successful 3d refining steps, ie in "firstrefine" or even after doing some improvements like "defocus" or "anisotropy mag correction". During refining at those points the only error we saw was that of "lack of memory", which was solved not using GPU in the subsequent iterations. It seems like the data now, after polishing the particles, is more space consuming (?), or maybe something is missing/wrong in our configuration (?). The problem is that we cannot continue processing until we solve this situation. We would like to know the origin of this error and, if possible, how to solve it. Please find below more info about our machine and configurations. Thanks a lot in advance for the answer!

Environment: - OS: Ubuntu 18.04.5 LTS - MPI runtime: OpenMPI 2.1.1 - RELION version: RELION-3.1.0-commit-6f0d9e - Memory: 156 GB of RAM, 4 Tb of Swap in HD disk - CPU: 2 Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (10 cores each, 40 threads in total) - GPU: 2 GeForce GTX 1080Ti

Dataset: - Box size: 850 px - Pixel size: 1.30 Å/px - Number of particles: 5,223 - Description: An icosahedral virus imposing I symmetry

Job options: - Type of job: Refine3D imposing I symmetry - Number of MPI processes: 3 - Number of threads: 10 - Full command: PLEASE NOTE THAT SENSITIVE DATA WAS CHANGED TO "xxxx" WE ARE SHOWING BOTH THE "FIRST" AND THE "CONTINUE" COMMANDS

++++ Executing new job on Sun Nov 22 13:46:31 2020 ++++ with the following command(s): which relion_refine_mpi --o Refine3D/job042/run --auto_refine --split_random_halves --i Polish/job040/shiny.star --ref Class3D/job020/run_it030_class002_box850.mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --scratch_dir /home/xxxx/relion_tmp/ --pool 2 --pad 2 --skip_gridding --ctf --ctf_corrected_ref --particle_diameter 900 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 3 --auto_local_healpix_order 5 --offset_range 5 --offset_step 2 --sym I --low_resol_join_halves 40 --norm --scale --j 10 --gpu "" --pipeline_control Refine3D/job042/ ++++

++++ Executing new job on Mon Nov 23 06:54:46 2020 ++++ with the following command(s): which relion_refine_mpi --continue Refine3D/job042/run_it020_optimiser.star --o Refine3D/job042/run_ct20 --dont_combine_weights_via_disc --scratch_dir /home/xxxx/relion_tmp/ --pool 2 --pad 2 --skip_gridding --particle_diameter 900 --j 10 --pipeline_control Refine3D/job042/ ++++

Error message: PLEASE NOTE THAT THE ERROR THAT WE ARE INTERESTED IN SOLVING IS THE LAST ONE, RELATED TO ALLOCATION.

                     ***WARNING***

With the current settings and hardware, you will be able to use an estimated image-size of 511 pixels during the last iteration...

...but your input box-size (image_size) is however 850. This means that you will likely run out of memory on the GPU(s), and will have to then re-start from the last completed iteration (i.e. continue from it) without the use of GPUs.

ERROR: out of memory in /opt/relion31/relion/src/acc/acc_backprojector_impl.h at line 38 (error-code 2) in: /opt/relion31/relion/src/acc/cuda/cuda_settings.h, line 67 ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

=== Backtrace === /opt/relion31/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x77) [0x5598d25659e7] /opt/relion31/bin/relion_refine_mpi(_ZN16AccBackprojector9setMdlDimEiiiiiif+0x268) [0x5598d277ac88] /opt/relion31/bin/relion_refine_mpi(_ZN14MlDeviceBundle22setupFixedSizedObjectsEv+0x275) [0x5598d2793985] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x309d) [0x5598d258622d] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xfb) [0x5598d25985fb] /opt/relion31/bin/relion_refine_mpi(main+0x73) [0x5598d2550103] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xe7) [0x7f2524049bf7] /opt/relion31/bin/relion_refine_mpi(_start+0x2a) [0x5598d255320a]

ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

ERROR: out of memory in /opt/relion31/relion/src/acc/acc_backprojector_impl.h at line 38 (error-code 2) in: /opt/relion31/relion/src/acc/cuda/cuda_settings.h, line 67 ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

=== Backtrace === /opt/relion31/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x77) [0x55976b6c19e7] /opt/relion31/bin/relion_refine_mpi(_ZN16AccBackprojector9setMdlDimEiiiiiif+0x268) [0x55976b8d6c88] /opt/relion31/bin/relion_refine_mpi(_ZN14MlDeviceBundle22setupFixedSizedObjectsEv+0x275) [0x55976b8ef985] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x309d) [0x55976b6e222d] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xfb) [0x55976b6f45fb] /opt/relion31/bin/relion_refine_mpi(main+0x73) [0x55976b6ac103] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xe7) [0x7f59c913cbf7] /opt/relion31/bin/relion_refine_mpi(_start+0x2a) [0x55976b6af20a]

ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

in: /opt/relion31/relion/src/multidim_array.h, line 702 ERROR: Allocate: No space left === Backtrace === /opt/relion31/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x77) [0x55a5baded9e7] /opt/relion31/bin/relion_refine_mpi(_ZN13MultidimArrayIdE12coreAllocateEv+0x365) [0x55a5bae25b95] /opt/relion31/bin/relion_refine_mpi(_ZN13BackProjector11reconstructER13MultidimArrayIdEibRKS1_ddibP5ImageIdE+0x219) [0x55a5bae86969] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi46readTemporaryDataAndWeightArraysAndReconstructEii+0xf7d) [0x55a5bae1a2fd] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0xf2d) [0x55a5bae1b6ed] /opt/relion31/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x458) [0x55a5bae20958] /opt/relion31/bin/relion_refine_mpi(main+0x73) [0x55a5badd8103] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xe7) [0x7fbeac819bf7] /opt/relion31/bin/relion_refine_mpi(_start+0x2a) [0x55a5baddb20a]

ERROR: Allocate: No space left

biochem-fan commented 3 years ago

Are the box size and the pixel size the same before and after Polish? How about resolution and angular sampling? If the resolution improved and/or Refine3D continued to finer angular sampling after Polish, you need more memory than before.

The first thing I would try is to use Skip padding: Yes. This creates artifacts at corners of the box, but as long as you can mask them out, you don't gave to worry too much.

gersgro commented 3 years ago

Hi biochem-fan, thanks for your reply, let me answer point by point.

Are the box size and the pixel size the same before and after Polish? Yes. The original pixel size was 1.1075, but after box re-sizing we got 1.3029, which was maintained all along the processing. In fact we used always the option -1 where asked the pixel size in every subsequent step in order to avoid mistakes.

How about resolution and angular sampling? If the resolution improved and/or Refine3D continued to finer angular sampling after Polish, you need more memory than before. The following values correspond to the last iteration (the one considered as converged) for refining using pre- and post-polished particles: Pre: Auto-refine: Iteration= 34 Auto-refine: Resolution= 4.6928 (no gain for 1 iter) Auto-refine: Changes in angles= 0.0319699 degrees; and in offsets= 0.101337 Angstroms (no gain for 4 iter) Post: Auto-refine: Iteration= 34 Auto-refine: Resolution= 4.57645 (no gain for 2 iter) Auto-refine: Changes in angles= 0.0315159 degrees; and in offsets= 0.0881933 Angstroms (no gain for 2 iter)

So, both resolution and angular sampling improved, but in our opinion not as much for having this problem, right? In any case, we are not expecting a big improvement when using shiny particles, but really want to solve the issue.

Something I did not mention before is that previously we tried the same refinement parameters but using a mask (which is optional) in order to mask out the solvent and improve the results. This option was used with "Use solvent-flattened FSC? set to yes". In this case, we were not even able to get to the 14th iteration, the process aborted due to the same problem, "Allocate: No space left". In the run.out file, the information right before crushing is like this: RELION version: 3.1.0-commit-6f0d9e exiting with an error ...

MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

Interestingly, the error appears at any point during the intra-iteration processing, ie we saw it at "expectation" and "maximization". So, everything points (in our opinion) to a problem in machine configuration. At some point, for some reason, RELION cannot allocate the data and exits. Should we check something like restrictions/limitations in memory usage? Remember we have 156 Gb RAM and 4 Tb Swap, so there should not be any problem, right?

The first thing I would try is to use Skip padding: Yes. This creates artifacts at corners of the box, but as long as you can mask them out, you don't gave to worry too much. OK, we will try this, and give a feedback.

Thanks again!

biochem-fan commented 3 years ago

The following values correspond to the last iteration (the one considered as converged) for refining using pre- and post-polished particles:

You have to check OrientationalSampling=, not the estimated accuracy.

Remember we have 156 Gb RAM and 4 Tb Swap, so there should not be any problem, right?

Having 4 TB of swap does NOT guarantee a program can allocate 4 TB of memory.

You should check how much memory relion_refine_mpi is actually using by top.

gersgro commented 3 years ago

You have to check OrientationalSampling= , not the estimated accuracy. Oops, you are right here. Next are the values, from the pre-polishing refinement (which ended OK) and the post-polishing refinement (presenting problems). I still have the same feelings as before: Pre-polishing CurrentResolution= 4.6928 Angstroms, which requires orientationSampling of at least 0.597015 degrees for a particle of diameter 900 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 1305 OrientationalSampling= 0.0585938 NrOrientations= 145 TranslationalSampling= 0.390882 NrTranslations= 9 Oversampling= 1 NrHiddenVariableSamplingPoints= 41760 OrientationalSampling= 0.0292969 NrOrientations= 1160 TranslationalSampling= 0.195441 NrTranslations= 36 Post-polishing CurrentResolution= 4.57645 Angstroms, which requires orientationSampling of at least 0.582524 degrees for a particle of diameter 900 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 1305 OrientationalSampling= 0.0585938 NrOrientations= 145 TranslationalSampling= 0.390882 NrTranslations= 9 Oversampling= 1 NrHiddenVariableSamplingPoints= 41760 OrientationalSampling= 0.0292969 NrOrientations= 1160 TranslationalSampling= 0.195441 NrTranslations= 36

Having 4 TB of swap does NOT guarantee a program can allocate 4 TB of memory. Oh, so how can I confirm this? How can I know the total amount of a swap a program can allocate? Is there a simple way to do it and adjust as needed?

You should check how much memory relion_refine_mpi is actually using by top. I am actually running a script to extract the use of memory using the free command every 2 seconds and then making a graph of those values, top shows % of memory (%MEM) used. The %MEM in top moves around 95% all the time (two MPIs use ~44-46% and the third ~1%), so I assume it represents the % of the total (RAM+Swap) memory used at every time. By doing via free I can tell that the whole 155-156 GB RAM are used during the refinement to the Nyquist. During this step (where the program exits as commented before) the use of swap moves from just a couple of GB at the beginning to over 200 GB, presenting some peaks of 250 GB or so. This said, in our understanding RELION feels free to use the amount of swap it needs. Of course our machine gets slow due to the consumption of almost the whole RAM, something we also would like to control but do not know how yet, as every attempt done so far only limits the use of memory per process and not per user. These amounts of memory I have mentioned in previous paragraphs are not uncommon, we already saw then (and even peaks of up to 500 GB, mostly from the swap) when running Refine3D in previous steps of the processing, and those processes finished correctly and nicely, see example above for refining "non-polished" particles.

In summary, we do not understand why RELION exits prematurely claiming lack of memory. Is it possible that the problem is in OpenMPI? If somebody thinks so, we would appreciate sharing more information.

Thanks,

biochem-fan commented 3 years ago

I am sure you don't need terabytes of memory to refine a 850 px. Something is weird but it is almost impossible to diagnose this sort of issue without access to the machine and the dataset.

biochem-fan commented 3 years ago

Closing an old issue without responses.

3dem / relion

"Allocate: No space left" while 3D refining polished particles #715

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.