Auto-refine jobs stalls or crashes

daniel-s-d-larsson commented 4 years ago

I try to run auto-refine jobs, but they either crashes or stalls on the first expectation step (100% CPU utilization, 0% GPU, nothing written out to run.out). When it crashes, the error is either Segmentation fault or custom cuda allocator error.

Things I tested:

Recompiled yesterday, in case there were some recent bug fix
Run with --maxsig 1000 in case the GPU memory is too small for difficult particles (Example 2)
Re-running an auto-refine jobs with the exact same parameters that worked a few weeks ago, but now they crash.

Could this be a hardware problem?

Environment:

OS:Ubuntu 18.04 LTS
MPI runtime: OpenMPI 2.1.1
RELION version: 3.1-beta-commit-c17c89
Memory: 64 GB
GPU: 2 x GTX 1080Ti

Dataset:

Box size: 512 px
Pixel size: 0.82 Å/px
Number of particles: 364,000
Description: ribosome

Job options:

Type of job: Refine3D
Number of MPI processes: 3
Number of threads: 16

Example 1 Run command: relion_refine_mpi --o Refine3D/job124/run --auto_refine --split_random_halves --i Select/job120/particles.star --ref Refine3D/job112/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /scratch/relion1 --pool 100 --pad 1 --ctf --ctf_corrected_ref --particle_diameter 310 --flatten_solvent --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "" --maxsig 1000 --pipeline_control Refine3D/job124/

Error message (run.out):

Expectation iteration 1
2.25/20.98 min ......~~(,_,">[klug:76463] *** Process received signal ***     [oo]
(1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) [7168B] (512B) [1536B] (1536B) [512B] (512B) [4608B] <512B> [512B] (512B) <1536B> [512B] (512B) <512B> (512B) (512B) [512B] (512B) <1536B> <512B> [512B] <512B> <1536B> (5632B) (1536B) [512B] <512B> <512B> [512B] (512B) (512B) [512B] <512B> (512B) <512B> [512B] (1536B) (512B) (512B) (512B) [512B] (512B) (512B) (512B) [512B] (3072B) [512B] (512B) (11776B) [512B] <512B> <512B> [512B] (3072B) (512B) (512B) [512B] (5632B) [512B] (512B) (1024B) [512B] <512B> <1536B> [512B] <512B> <512B> <512B> (512B) (4608B) (1536B) (1536B) [512B] (4608B) [1536B] (4608B) [1536B] (512B) [1024B] (512B) (512B) [1536B] (5632B) (5632B) (4608B) [3584B] (5632B) (3072B) [11264B] (1536B) [512B] <3072B> <5632B> (4608B) [3584B] (5632B) (5632B) [3584B] <5632B> <5632B> (3072B) (5632B) [1024B] <5632B> <10752[klug:76463] Signal: Segmentation fault (11)
[klug:76463] Signal code: Address not mapped (1)
[klug:76463] Failing at address: 0x40
B> [11264B] (18432B) [4096B] (5632B) (4608B) (5632B) (5632B) (5632B) [7680B] (5632B) (5632B) (18432B) [4096B] (5632B) [4608B] (5632B) [2048B] (10752B) (10752B) [5632B] (5632B) (5632B) (5632B) [4096B] (5632B) [3584B] (10752B) (5632B) [12800B] (5632B) [1536B] (10752B) (5632B) [18432B] (10752B) (21504B) (10752B) [512B] <5632B> [5632B] <5632B> <5632B> [7168B] <5632B> <5632B> (18432B) (18432B) [14848B] (5632B) (18432B) [17920B] <5632B> (128000B) [18432B] (5632B) (5632B) (10752B) (5632B) (10752B) <98304B> [31744B] (18432B) [5632B] <5632B> <5632B> <10752B> [79872B] <36864B> [42496B] (36864B) [229376B] (236544B) [281600B] (18432B) [240640B] (387072B) (128000B) [37888B] (473088B) [165888B] <165888B> <387072B> (387072B) (387072B) [65536B] (387072B) [309248B] (165888B) [221184B] (473088B) (387072B) [908288B] (473088B) (473088B) (1048576B) [2048B] (1048576B) (1052672B) (1048576B) [574976B] (387072B) [86528B] (1048576B) (1052672B) (946176B) (1048576B) (1048576B) [989184B] (387072B) [5017600B] (1048576B) (1052672B) [2097152B] (1048576B) (1052672B) [1048576B] (1048576B) (1052672B) [8177766144B] = 8235712256B
[klug:76463] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f1091f87890]
[klug:76463] [ 1] /usr/lib/x86_64-linux-gnu/libcuda.so.1(+0xe008d)[0x7f1073ef108d]
[klug:76463] [ 2] /usr/lib/x86_64-linux-gnu/libcuda.so.1(cuEventRecord_ptsz+0x5d)[0x7f10740289dd]
[klug:76463] [ 3] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(+0x2c1f12)[0x557dbdfa9f12]
[klug:76463] [ 4] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(+0x3079bb)[0x557dbdfef9bb]
[klug:76463] [ 5] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(_Z17storeWeightedSumsI15MlOptimiserCudaEvR21OptimisationParamtersR18SamplingParametersP11MlOptimiserPT_RSt6vectorI16IndexedDataArraySaISA_EERS9_I16ProjectionParamsSaISE_EERS9_IS9_I20IndexedDataArrayMaskSaISI_EESaISK_EE13AccPtrFactoryiRS9_I12AccPtrBundleSaISP_EE+0x655c)[0x557dbdf83bcc]
[klug:76463] [ 6] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(+0x2710ea)[0x557dbdf590ea]
[klug:76463] [ 7] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(_ZN15MlOptimiserCuda32doThreadExpectationSomeParticlesEi+0xed)[0x557dbdf5b01d]
[klug:76463] [ 8] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0x3d)[0x557dbddeee0d]
[klug:76463] [ 9] /usr/local/relion-3.1b_20200215/bin/relion_refine_mpi(_Z11_threadMainPv+0x4e)[0x557dbdd6a8ce]
[klug:76463] [10] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f1091f7c6db]
[klug:76463] [11] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f1090d2b88f]
[klug:76463] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node klug exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Error message (run.err):

ERROR: unspecified launch failure in /home/larsson/src/relion/src/acc/cuda/custom_allocator.cuh at line 176 (error-code 4)
in: /home/larsson/src/relion/src/acc/cuda/cuda_settings.h, line 67
ERROR: 

A GPU-function failed to execute.

 If this occured at the start of a run, you might have GPUs which
are incompatible with either the data or your installation of relion.
If you 

    -> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
       and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
       this may happen.

    -> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
       at least compute 3.5. You may be trying to use a GPU older than
       this. If you have multiple generations, try specifying --gpu <X>
       with X=0. Then try X=1 in a new run, and so on. The numbering of
       GPUs may not be obvious from the driver or intuition. For a list
       of GPU compute generations, see 

       en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

    -> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
       as to not require this, and may thus have unforeseen requirements
       when run in this mode. If you think it is nonetheless necessary,
       please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

    -> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
       subject to many restrictions, and relion is written to work within
       common restraints. If you have exotic data or settings, unexpected
       configurations may occur. See also above point regarding 
       double precision.
If none of the above applies, please report the error to the relion
developers at    github.com/3dem/relion/issues

Example 2 Run command: relion_refine_mpi --o Refine3D/job125/run --auto_refine --split_random_halves --i CtfRefine/job116/particles_ctf_refine.star --ref Refine3D/job112/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /scratch/relion1 --pool 100 --pad 1 --ctf --ctf_corrected_ref --particle_diameter 310 --flatten_solvent --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "" --maxsig 1000 --pipeline_control Refine3D/job125/

Error message:

Expectation iteration 1
2.92/26.50 min ......~~(,_,">[klug:85942] *** Process received signal ***     [oo]
[klug:85942] Signal: Segmentation fault (11)
[klug:85942] Signal code: Address not mapped (1)
[klug:85942] Failing at address: 0x40
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node klug exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Example 3 Run command: relion_refine_mpi --o Refine3D/job127/run --auto_refine --split_random_halves --i CtfRefine/job116/particles_ctf_refine.star --ref Refine3D/job112/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /scratch/relion1 --pool 100 --pad 1 --ctf --ctf_corrected_ref --particle_diameter 310 --flatten_solvent --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "" --maxsig 1000 --pipeline_control Refine3D/job127/

Error message:

Expectation iteration 5
4.40/39.98 min ......~~(,_,">(1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (1048576B) (1052672B) (512B) (512B) (512B) (1536B) (512B) <512B> <512B> [512B] (512B) (512B) (512B) (512B) (512B) (512B) [512B] (512B) (512B) (512B) [512B] (512B) (1536B) (512B) (512B) [512B] (512B) (1536B) [512B] (512B) (512B) (512B) (3072B) (3072B) (512B) (1536B) (1536B) [1024B] (512B) (3584B) [1536B] (512B) (1536B) (1536B) [9216B] (512B) [2560B] (512B) [1024B] (512B) [7680B] (512B) (16896B) [12800B] (512B) (16896B) [1024B] (4096B) <18432B> [9216B] (3072B) <18432B> [7680B] <18432B> [6144B] (5632B) [8192B] (4608B) [13312B] (5632B) (5632B) [512B] (5632B) (18432B) [512B] (3072B) (5632B) (5632B) [12800B] (36352B) (172032B) [24576B] (51200B) [15360B] (86016B) (10752B) (10752B) (10752B) (10752B) (21504B) (32256B) (64512B) (64512B) (64512B) (64512B) (98304B) [24576B] (10752B) (10752B) (10752B) (98304B) [58880B] (5632B) (5632B) (10752B) [42496B] (5632B) [29184B] (10752B) (21504B) (5632B) [5120B] (10752B) (172032B) [18432B] (129024B) (172032B) (172032B) (98304B) [12800B] (172032B) (172032B) (98304B) [43008B] (172032B) (172032B) [88576B] (172032B) (172032B) (172032B) [172032B] (102400B) (102400B) [41472B] (172032B) (162816B) [9216B] (102400B) (172032B) [140288B] (344064B) [172032B] (102400B) (172032B) (172032B) [172032B] (172032B) (172032B) (387072B) (172032B) (172032B) (1048576B) (1048576B) (387072B) (172032B) (172032B) [84480B] (204288B) (172032B) (172032B) (172032B) (172032B) [131072B] (1048576B) (1052672B) <1048576B> (172032B) (172032B) (172032B) (172032B) (172032B) (172032B) [16384B] (1048576B) (1052672B) (391680B) (172032B) (172032B) (172032B) [140800B] (1048576B) (1052672B) (1048576B) (1052672B) (172032B) (379392B) [497152B] (1048576B) (1052672B) [1048576B] (1048576B) (1052672B) [1048576B] (1048576B) (1052672B) (1048576B) (1048576B) [1048576B] (1048576B) (1052672B) (1048576B) [1048576B] (1048576B) (1052672B) [1048576B] (1048576B) (1052672B) [1048576B] (1048576B) (1048576B) (1048576B) (1048576B) (1048576B) (1048576B) [7900609280B] = 7965179648B
KERNEL_ERROR: unspecified launch failure in /home/larsson/src/relion/src/acc/utilities_impl.h at line 247 (error-code 4)
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node klug exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

dkimanius commented 4 years ago

All of the reported error messages point to an issue in the runtime environment. Have you recently done a driver/hardware upgrade?

If all your refinements are failing, then you have a very reproducible issue, which is good. You should go back to a RELION version that you know was working fine. You can git checkout to a major version commit. If you still get this error then you'll know something has changed in you environment that is causing it.

daniel-s-d-larsson commented 4 years ago

I recently patched my system, since I had a long backlog of updates. That was probably a bad idea and likely the culprit to these problems.

daniel-s-d-larsson commented 4 years ago

This may be GPU related. I found this issue https://github.com/3dem/relion/issues/436, which has quite similar error output. It turned out to be caused by ram problems on the GPU. Is there a way to test the memory of the GPU, if it is hardware related? I'm currently running a job w/o GPUs and it has reached iteration 4 without any issues.

Checking /var/log/apt/history.log, it seems that the system upgraded the nvidia drivers nvidia-384-dev:amd64 from version 390.116-0ubuntu0.18.04.1 to version 390.116-0ubuntu0.18.04.3. From the name, it doesn't seems to be a major update, although I'm not very familiar with these things. Should I try to revert to version 390.116-0ubuntu0.18.04.1? I'm not really sure how to proceed.

daniel-s-d-larsson commented 4 years ago

Ok, now I ran https://github.com/ihaque/memtestG80 for 100 iterations on each of the two GTX 1080Ti cards without any errors, so hardware seems to be fine.

daniel-s-d-larsson commented 4 years ago

So, I tried reverting the Ubuntu package of the Nvidia drivers to the previous one, but to no avail.

Therefore I decided to run the hardware stress test for 1000 iterations and after about 120 iterations, one of the cards started to throw lots of errors and eventually completely locked up. So this was indeed a hardware problem.

dkimanius commented 4 years ago

Sounds like the issue is triggered by a temperature induced hardware problem. Glad you found the issue.

3dem / relion

Auto-refine jobs stalls or crashes #577