bforsbe commented 7 years ago

Originally reported by: Bharat Reddy (Bitbucket: barureddy, GitHub: barureddy)

When I run 2D Classification using my gpus I get an error with relion stalling before it reaches the 25 iterations. The master thread still runs in a zobie like process. My issue looks like #24, but I am not sure. The particle files are quite large, but I can upload them to you if needed.

Bitbucket: https://bitbucket.org/tcblab/relion2-beta/issue/166

bforsbe commented 7 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

Could you repeat the same run as above with --pool 1 and check its behaviour?

bforsbe commented 7 years ago

Original comment by Robert McLeod (Bitbucket: robbmcleod, GitHub: robbmcleod):

Typically 100. These nodes have 512 GB of system memory so a lack of allocatable memory should not be an issue in our case.

bforsbe commented 7 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

Our current status to this problem is the following:

The problem appears to be more likely with more pooled particles. If we set --pool 1, then the problem does not seem to exist any more and the 2D classification works fine. We have never encountered such a problem when using --pool 1 now. Before, we used to take values like --pool 50 for 2D classification and it seems that the problem of a stalling GPU occurred much more often.

@Robert McLeod, how many particles did you pool for the run you posted here?

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Hi Dari,

I have now created a new issue (#189), which I hope is ok.

Best wishes James

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Hi Dari,

I did read the original issue and I thought initially that it was the same. Then I thought it made sense to continue in the same place rather than spreading things everywhere. In hindsight I agree that wasn't a good idea and I'm sorry.

I will post a new issue and attach the files there as you say.

Best wishes James

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

Hi @james_krieger! Please read the original issue description next time before hijacking it. Your issue is not related to this one and will certainly make things confusing for future references. Post this report in a new issue and attach the (small sized) output files from the run.

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Actually this issue is still there. It now got the following error at the start of iteration 14:

Expectation iteration 13 of 25 20.07/20.07 min ............................................................(,,"> Maximization ... 21/ 21 sec ............................................................(,,"> Estimating accuracies in the orientational assignment ... 10/ 10 sec ............................................................~~(,_,"> Auto-refine: Estimated accuracy angles= 0.7 degrees; offsets= 0.66 pixels CurrentResolution= 5.74284 Angstroms, which requires orientationSampling of at least 1.82741 degrees for a particle of diameter 360 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 32000 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 4 NrTranslations= 5

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Expectation iteration 14 of 25 0.62/18.02 min ..~~(,_,">KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 419 (error-code 4) KERNEL_ERROR: unspecified launch failure in /data/bin/relion2-beta_copy_newPC/src/gpu_utils/cuda_ml_optimiser.cu at line 2071 (error-code 4)

mpirun noticed that process rank 1 with PID 8750 on node ig-pc-10 exited on signal 11 (Segmentation fault).

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

ok I cloned and installed relion2-beta again to be sure. It's not running fine and has reached iteration 3 with 2 GPUs so it looks like I did hit a fixed issue.

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I just thought that I'm not using the latest beta version necessarily - I am using relion-devel-lmb with the cmake-3.7 patch applied. I am now installing the latest relion2-beta, which I cloned from git a few days ago and applied the cmake-3.7 patch to as well. It could be that I am reporting an issue that has already be resolved and in that case sorry.

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

Also to help interpret that, I should mention that my system maps GPUs wrong so if I ask for the GPU with ID 0 then nvidia-smi shows that GPU #2 has been activated. The same happens when running gromacs-2016.

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

For 3 GPUs, I have -np 4 and --j 4 --gpu "0:1:2" For 2 GPUs, I have -np 3 and --j 4 --gpu "0:1"

In both cases, I am specifying which GPU for all slave ranks and the assignment is showing up correctly in the run.out files.

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I forgot to mention that I eventually had to kill the job when it had not done something for 20-30 minutes and then I continued it from the end of iteration 1.

bforsbe commented 7 years ago

Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):

I have experienced a similar problem to Robert on my system with 3 Titan X Maxwells when I run two of them. Relion stalls at the end of iteration 2 without any error and one of the GPUs holds data in memory but doesn't appear to be doing anything and the run.out doesn't write anything after shortly before the expected end of the iteration. I paste some monitoring data (nvidia-smi, sensors and mpstat) followed by the run.out, which also shows that I can continue the run and complete and start iteration 3. I have not seen this issue using when using all three GPUs but I have also seen the following error in run.out files in both cases (which I paste before the monitoring for the stalling). It should probably be noted that the all three GPUs case has allowed me to run a full set of 25 iterations. All these runs are using the Benchmark dataset.

Expectation iteration 2 of 25 000/??? sec (,,"> [oo] 0.55/28.93 min .(,,"> 1.10/32.15 min ..(,,"> 1.65/33.38 min ..(,,"> 2.20/34.05 min ...(,,"> 2.75/34.45 min ....(,,"> 3.32/34.90 min .....(,,"> 3.88/35.23 min ......(,,"> 4.43/35.33 min .......(,,"> 4.98/35.43 min ........(,,"> 5.55/35.62 min .........(,,"> 6.10/35.67 min ..........(,,"> 6.65/35.70 min ...........(,,"> 7.22/35.82 min ............(,,"> 7.77/35.85 min ............(,,"> 8.33/35.93 min .............(,,"> 8.90/36.02 min ..............(,,"> 9.45/36.03 min ...............(,,"> 10.02/36.10 min ................(,,"> 10.58/36.15 min .................(,,"> 11.13/36.15 min ..................(,,"> 11.70/36.22 min ...................(,,"> 12.27/36.25 min ....................(,,"> 12.82/36.25 min .....................(,,"> 13.38/36.30 min ......................(,,"> 13.93/36.28 min .......................(,,"> 14.50/36.33 min .......................(,,"> 15.07/36.37 min ........................(,,"> 15.62/36.35 min .........................(,,"> 16.17/36.35 min ..........................(,,"> 16.72/36.35 min ...........................(,,"> 17.28/36.37 min ............................(,,"> 17.83/36.37 min .............................(,,"> 18.37/36.33 min ..............................(,,"> 18.93/36.35 min ...............................(,,"> 19.48/36.35 min ................................(,,"> 20.03/36.35 min .................................(,,"> 20.58/36.33 min .................................(,,"> 21.15/36.37 min ..................................(,,"> 21.70/36.37 min ...................................(,,"> 22.25/36.35 min ....................................(,,"> 22.80/36.35 min .....................................(,,"> 23.35/36.35 min ......................................(,,"> 23.92/36.37 min .......................................(,,">KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4) KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4) KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4) KERNEL_ERROR: unspecified launch failure in /data/bin/EM/RELION/relion-devel-lmb_copy_newPC/src/gpu_utils/cuda_helper_functions.cu at line 386 (error-code 4)

mpirun noticed that process rank 2 with PID 16494 on node ig-pc-10 exited on signal 11 (Segmentation fault).

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +40.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +40.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:44:51 CPU 13:44:51 all 12.62 13:44:51 0 14.97 13:44:51 1 14.93 13:44:51 2 15.23 13:44:51 3 14.99 13:44:51 4 15.47 13:44:51 5 15.92 13:44:51 6 15.80 13:44:51 7 19.94 13:44:51 8 11.36 13:44:51 9 11.89 13:44:51 10 11.10 13:44:51 11 10.27 13:44:51 12 13:44:51 13 13:44:51 14 13:44:51 15 Thu Jan 19 13:45:51 2017
+----------------------- | NVIDIA-SMI 370.23 |----------------------- | GPU Name | Fan Temp Perf Pwr:Usage/Cap| |======================= | 0 GeForce GTX TIT... | 22% 52C P8 17W / 250W | +----------------------- | 1 GeForce GTX TIT... | 53% 83C P2 +----------------------- | 2 GeForce GTX TIT... | 38% 79C P2 +----------------------- %usr %nice %sys %iowait %irq %soft %steal %guest %idle 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.37 0.00 3.83 0.12 0.00 0.00 0.00 0.00 81.07 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.24 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.11 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.52 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.28 0.00 2.99 0.02 0.00 0.00 0.00 0.00 81.07 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.33 0.00 2.22 0.66 0.00 0.00 0.00 0.00 77.17 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.33 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.84 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.84 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.92 9.31 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.23 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84 7.93 0.00 1.91 1.24 0.00 0.01 0.00 0.00 88.91 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10 ------------------------------------------------------+ Driver Version: 370.23 | --------+----------------------+----------------------+ Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | Memory-Usage | GPU-Util Compute M. | ========+======================+======================| Off | 0000:05:00.0 On | N/A | 93MiB / 12206MiB | 0% Default | --------+----------------------+----------------------+ Off | 0000:06:00.0 Off | N/A | 141W / 250W | 11495MiB / 12206MiB | 62% Default | --------+----------------------+----------------------+ Off | 0000:09:00.0 Off | N/A | 149W / 250W | 11495MiB / 12206MiB | 77% Default | --------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +42.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +36.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:45:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:45:51 all 12.63 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.35 13:45:51 0 15.00 0.00 3.84 0.12 0.00 0.00 0.00 0.00 81.04 13:45:51 1 14.95 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.22 13:45:51 2 15.26 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.09 13:45:51 3 15.01 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.50 13:45:51 4 15.49 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.26 13:45:51 5 15.98 0.00 2.99 0.01 0.00 0.00 0.00 0.00 81.02 13:45:51 6 15.84 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.29 13:45:51 7 19.97 0.00 2.22 0.66 0.00 0.00 0.00 0.00 77.14 13:45:51 8 11.38 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.31 13:45:51 9 11.91 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.83 13:45:51 10 11.12 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.83 13:45:51 11 10.28 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.91 13:45:51 12 9.32 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.22 13:45:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84 13:45:51 14 7.93 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.91 13:45:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10 Thu Jan 19 13:46:51 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 370.23 Driver Version: 370.23 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A | | 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A | | 53% 83C P2 143W / 250W | 11495MiB / 12206MiB | 83% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A | | 38% 78C P2 151W / 250W | 11495MiB / 12206MiB | 77% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +48.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +42.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +48.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +37.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:46:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:46:51 all 12.65 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.33 13:46:51 0 15.02 0.00 3.84 0.12 0.00 0.00 0.00 0.00 81.02 13:46:51 1 14.97 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.20 13:46:51 2 15.28 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.07 13:46:51 3 15.03 0.00 3.43 0.06 0.00 0.00 0.00 0.00 81.48 13:46:51 4 15.51 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.24 13:46:51 5 16.00 0.00 2.99 0.01 0.00 0.00 0.00 0.00 81.00 13:46:51 6 15.91 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.22 13:46:51 7 19.99 0.00 2.22 0.66 0.00 0.00 0.00 0.00 77.12 13:46:51 8 11.40 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.29 13:46:51 9 11.91 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.82 13:46:51 10 11.13 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.81 13:46:51 11 10.30 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.89 13:46:51 12 9.33 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.21 13:46:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84 13:46:51 14 7.93 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.92 13:46:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10 Thu Jan 19 13:47:51 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 370.23 Driver Version: 370.23 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A | | 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A | | 53% 83C P2 138W / 250W | 11495MiB / 12206MiB | 82% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A | | 37% 70C P2 76W / 250W | 11495MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +40.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +37.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:47:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:47:51 all 12.67 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.31 13:47:51 0 15.04 0.00 3.84 0.12 0.00 0.00 0.00 0.00 81.00 13:47:51 1 14.99 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.18 13:47:51 2 15.30 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.04 13:47:51 3 15.04 0.00 3.43 0.06 0.00 0.00 0.00 0.00 81.48 13:47:51 4 15.52 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.23 13:47:51 5 16.02 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.97 13:47:51 6 15.95 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.17 13:47:51 7 20.00 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.11 13:47:51 8 11.41 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.28 13:47:51 9 11.92 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.81 13:47:51 10 11.14 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.80 13:47:51 11 10.37 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.83 13:47:51 12 9.33 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.21 13:47:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84 13:47:51 14 7.92 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.93 13:47:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10 Thu Jan 19 13:48:51 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 370.23 Driver Version: 370.23 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A | | 22% 53C P8 17W / 250W | 93MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A | | 52% 82C P2 142W / 250W | 11495MiB / 12206MiB | 80% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A | | 31% 63C P2 74W / 250W | 11495MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +35.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:48:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:48:51 all 12.68 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.30 13:48:51 0 15.05 0.00 3.84 0.12 0.00 0.00 0.00 0.00 80.99 13:48:51 1 15.00 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.16 13:48:51 2 15.32 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.02 13:48:51 3 15.02 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.49 13:48:51 4 15.55 0.00 3.24 0.02 0.00 0.00 0.00 0.00 81.20 13:48:51 5 16.02 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.97 13:48:51 6 15.99 0.00 2.82 0.06 0.00 0.00 0.00 0.00 81.12 13:48:51 7 20.00 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.11 13:48:51 8 11.40 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.29 13:48:51 9 11.93 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.80 13:48:51 10 11.14 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.80 13:48:51 11 10.45 0.00 2.79 0.01 0.00 0.00 0.00 0.00 86.75 13:48:51 12 9.32 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.22 13:48:51 13 8.15 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.85 13:48:51 14 7.91 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.94 13:48:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.11 Thu Jan 19 13:49:51 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 370.23 Driver Version: 370.23 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A | | 22% 53C P8 17W / 250W | 93MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A | | 51% 82C P2 148W / 250W | 11495MiB / 12206MiB | 78% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A | | 28% 62C P2 74W / 250W | 11495MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +45.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +35.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:49:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:49:51 all 12.69 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.29 13:49:51 0 15.07 0.00 3.84 0.12 0.00 0.00 0.00 0.00 80.97 13:49:51 1 15.01 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.16 13:49:51 2 15.39 0.00 3.59 0.07 0.00 0.00 0.00 0.00 80.96 13:49:51 3 15.03 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.49 13:49:51 4 15.61 0.00 3.25 0.02 0.00 0.00 0.00 0.00 81.13 13:49:51 5 16.04 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.95 13:49:51 6 16.00 0.00 2.82 0.06 0.00 0.00 0.00 0.00 81.12 13:49:51 7 20.00 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.10 13:49:51 8 11.40 0.00 3.27 0.02 0.00 0.01 0.00 0.00 85.29 13:49:51 9 11.94 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.80 13:49:51 10 11.13 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.82 13:49:51 11 10.45 0.00 2.79 0.01 0.00 0.00 0.00 0.00 86.74 13:49:51 12 9.32 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.23 13:49:51 13 8.14 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.86 13:49:51 14 7.91 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.95 13:49:51 15 4.60 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.11 Thu Jan 19 13:50:51 2017
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 370.23 Driver Version: 370.23 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A | | 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A | | 50% 82C P2 137W / 250W | 11495MiB / 12206MiB | 75% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A | | 27% 62C P2 74W / 250W | 11495MiB / 12206MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 15163 G /usr/bin/Xorg 89MiB | | 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | | 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB | +-----------------------------------------------------------------------------+ coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +33.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +45.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +37.0°C (high = +87.0°C, crit = +97.0°C)

Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)

13:50:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:50:51 all 12.70 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.28 13:50:51 0 15.07 0.00 3.84 0.12 0.00 0.00 0.00 0.00 80.97 13:50:51 1 15.03 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.14 13:50:51 2 15.46 0.00 3.59 0.07 0.00 0.00 0.00 0.00 80.89 13:50:51 3 15.05 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.46 13:50:51 4 15.66 0.00 3.27 0.02 0.00 0.00 0.00 0.00 81.05 13:50:51 5 16.06 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.93 13:50:51 6 16.01 0.00 2.82 0.06 0.00 0.00 0.00 0.00 81.10 13:50:51 7 20.01 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.10 13:50:51 8 11.39 0.00 3.27 0.02 0.00 0.01 0.00 0.00 85.30 13:50:51 9 11.93 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.81 13:50:51 10 11.12 0.00 3.04 0.00 0.00 0.00 0.00 0.00 85.83 13:50:51 11 10.45 0.00 2.79 0.01 0.00 0.00 0.00 0.00 86.75 13:50:51 12 9.31 0.00 2.45 0.00 0.00 0.00 0.00 0.00 88.24 13:50:51 13 8.13 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.87 13:50:51 14 7.90 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.95 13:50:51 15 4.60 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.12

=== RELION MPI setup ===

Number of MPI processes = 3
Number of threads per MPI process = 4
Total number of threads therefore = 12
Master (0) runs on host = ig-pc-10.lmb.internal
Slave 1 runs on host = ig-pc-10.lmb.internal
Slave 2 runs on host = ig-pc-10.lmb.internal Running CPU instructions in double precision.
WARNING: Changing psi sampling rate (before oversampling) to 11.25 degrees, for more efficient GPU calculations
On host ig-pc-10.lmb.internal: free scratch space = 172 Gb. Copying particles to scratch directory: /ssd/relionvolatile/ 000/??? sec ~~(,,"> [oo] 0.08/5.00 min ~~(,,"> 0.17/5.00 min .~~(,,"> 0.25/5.00 min ..~~(,,"> 0.33/5.00 min ...~~(,,"> 0.42/5.00 min ....~~(,,"> 0.50/5.00 min .....~~(,,"> 0.60/5.13 min ......~~(,,"> 0.70/5.25 min .......~~(,,"> 0.78/5.22 min ........~~(,,"> 0.88/5.30 min .........~~(,,"> 0.98/5.35 min ..........~~(,,"> 1.07/5.33 min ...........~~(,,"> 1.17/5.38 min ............~~(,,"> 1.27/5.42 min .............~~(,,"> 1.37/5.47 min ..............~~(,,"> 1.45/5.43 min ...............~~(,,"> 1.55/5.47 min ................~~(,,"> 1.63/5.43 min .................~~(,,"> 1.73/5.47 min ..................~~(,,"> 1.82/5.45 min ...................~~(,,"> 1.92/5.47 min ....................~~(,,"> 2.00/5.45 min .....................~~(,,"> 2.08/5.43 min ......................~~(,,"> 2.18/5.45 min .......................~~(,,"> 2.27/5.43 min ........................~~(,,"> 2.35/5.42 min .........................~~(,,"> 2.45/5.43 min ..........................~~(,,"> 2.53/5.42 min ...........................~~(,,"> 2.63/5.43 min ............................~~(,,"> 2.72/5.43 min .............................~~(,,"> 2.80/5.42 min ..............................~~(,,"> 2.90/5.43 min ...............................~~(,,"> 2.98/5.42 min ................................~~(,,"> 3.08/5.43 min .................................~~(,,"> 3.17/5.42 min ..................................~~(,,"> 3.25/5.42 min ...................................~~(,,"> 3.35/5.42 min ....................................~~(,,"> 3.43/5.42 min .....................................~~(,,"> 3.53/5.43 min ......................................~~(,,"> 3.62/5.42 min .......................................~~(,,"> 3.70/5.40 min ........................................~~(,,"> 3.80/5.42 min .........................................~~(,,"> 3.88/5.42 min ..........................................~~(,,"> 3.97/5.40 min ...........................................~~(,,"> 4.05/5.40 min ............................................~~(,,"> 4.15/5.40 min .............................................~~(,,"> 4.23/5.40 min ..............................................~~(,,"> 4.32/5.38 min ...............................................~~(,,"> 4.40/5.38 min ................................................~~(,,"> 4.50/5.40 min .................................................~~(,,"> 4.58/5.38 min ..................................................~~(,,"> 4.67/5.38 min ...................................................~~(,,"> 4.75/5.37 min ....................................................~~(,,"> 4.83/5.37 min .....................................................~~(,,"> 4.93/5.37 min ......................................................~~(,,"> 5.02/5.37 min .......................................................~~(,,"> 5.12/5.38 min ........................................................~~(,,"> 5.20/5.37 min .........................................................~~(,,"> 5.28/5.37 min ..........................................................~~(,,"> 5.38/5.38 min ...........................................................~~(,,"> 5.38/5.38 min ............................................................~~(,,"> Estimating initial noise spectra 000/??? sec ~~(,,"> [oo] 0.12/7.00 min ~~(,,"> 0.22/6.50 min .~~(,,"> 0.30/6.00 min ..~~(,,"> 0.40/6.00 min ...~~(,,"> 0.48/5.80 min ....~~(,,"> 0.57/5.67 min .....~~(,,"> 0.65/5.57 min ......~~(,,"> 0.73/5.50 min .......~~(,,"> 0.83/5.55 min ........~~(,,"> 0.92/5.50 min .........~~(,,"> 1.00/5.45 min ..........~~(,,"> 1.08/5.42 min ...........~~(,,"> 1.18/5.45 min ............~~(,,"> 1.27/5.42 min .............~~(,,"> 1.35/5.40 min ..............~~(,,"> 1.43/5.37 min ...............~~(,,"> 1.50/5.28 min ................~~(,,"> 1.58/5.27 min .................~~(,,"> 1.67/5.25 min ..................~~(,,"> 1.73/5.20 min ...................~~(,,"> 1.82/5.18 min ....................~~(,,"> 1.90/5.17 min .....................~~(,,"> 1.98/5.17 min ......................~~(,,"> 2.05/5.12 min .......................~~(,,"> 2.13/5.12 min ........................~~(,,"> 2.22/5.10 min .........................~~(,,"> 2.30/5.10 min ..........................~~(,,"> 2.37/5.07 min ...........................~~(,,"> 2.45/5.07 min ............................~~(,,"> 2.53/5.07 min .............................~~(,,"> 2.62/5.05 min ..............................~~(,,"> 2.68/5.02 min ...............................~~(,,"> 2.77/5.02 min ................................~~(,,"> 2.85/5.02 min .................................~~(,,"> 2.92/5.00 min ..................................~~(,,"> 3.00/5.00 min ...................................~~(,,"> 3.08/5.00 min ....................................~~(,,"> 3.17/5.00 min .....................................~~(,,"> 3.23/4.97 min ......................................~~(,,"> 3.32/4.97 min .......................................~~(,,"> 3.40/4.97 min ........................................~~(,,"> 3.48/4.97 min .........................................~~(,,"> 3.55/4.95 min ..........................................~~(,,"> 3.63/4.95 min ...........................................~~(,,"> 3.72/4.95 min ............................................~~(,,"> 3.80/4.95 min .............................................~~(,,"> 3.87/4.93 min ..............................................~~(,,"> 3.95/4.93 min ...............................................~~(,,"> 4.03/4.93 min ................................................~~(,,"> 4.10/4.92 min .................................................~~(,,"> 4.18/4.92 min ..................................................~~(,,"> 4.27/4.92 min ...................................................~~(,,"> 4.35/4.92 min ....................................................~~(,,"> 4.42/4.90 min .....................................................~~(,,"> 4.50/4.90 min ......................................................~~(,,"> 4.58/4.90 min .......................................................~~(,,"> 4.67/4.90 min ........................................................~~(,,"> 4.73/4.88 min .........................................................~~(,,"> 4.82/4.88 min ..........................................................~~(,,"> 4.90/4.90 min ...........................................................~~(,,"> 4.90/4.90 min ............................................................~~(,,"> uniqueHost ig-pc-10.lmb.internal has 2 ranks. Slave 1 will distribute threads over devices 0 Thread 0 on slave 1 mapped to device 0 Thread 1 on slave 1 mapped to device 0 Thread 2 on slave 1 mapped to device 0 Thread 3 on slave 1 mapped to device 0 Slave 2 will distribute threads over devices 1 Thread 0 on slave 2 mapped to device 1 Thread 1 on slave 2 mapped to device 1 Thread 2 on slave 2 mapped to device 1 Thread 3 on slave 2 mapped to device 1 Estimating accuracies in the orientational assignment ... 000/??? sec ~~(,,"> [oo] 0/ 0 sec ............................................................~~(,_,"> Auto-refine: Estimated accuracy angles= 999 degrees; offsets= 999 pixels CurrentResolution= 19.2959 Angstroms, which requires orientationSampling of at least 6.10169 degrees for a particle of diameter 360 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 32000 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 4 NrTranslations= 5

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Expectation iteration 1 of 25 000/??? sec ~~(,,"> [oo] 0.13/7.00 min .~~(,,"> 0.25/7.30 min ..~~(,,"> 0.38/7.75 min ..~~(,,"> 0.50/7.73 min ...~~(,,"> 0.62/7.72 min ....~~(,,"> 0.73/7.72 min .....~~(,,"> 0.85/7.70 min ......~~(,,"> 0.97/7.70 min .......~~(,,"> 1.08/7.70 min ........~~(,,"> 1.20/7.70 min .........~~(,,"> 1.32/7.68 min ..........~~(,,"> 1.43/7.68 min ...........~~(,,"> 1.55/7.68 min ............~~(,,"> 1.67/7.68 min ............~~(,,"> 1.80/7.75 min .............~~(,,"> 1.92/7.75 min ..............~~(,,"> 2.03/7.75 min ...............~~(,,"> 2.13/7.68 min ................~~(,,"> 2.25/7.68 min .................~~(,,"> 2.37/7.68 min ..................~~(,,"> 2.48/7.68 min ...................~~(,,"> 2.62/7.73 min ....................~~(,,"> 2.73/7.72 min .....................~~(,,"> 2.85/7.72 min ......................~~(,,"> 2.97/7.72 min .......................~~(,,"> 3.08/7.72 min .......................~~(,,"> 3.20/7.72 min ........................~~(,,"> 3.33/7.75 min .........................~~(,,"> 3.45/7.75 min ..........................~~(,,"> 3.57/7.75 min ...........................~~(,,"> 3.68/7.75 min ............................~~(,,"> 3.80/7.75 min .............................~~(,,"> 3.92/7.73 min ..............................~~(,,"> 4.03/7.73 min ...............................~~(,,"> 4.15/7.73 min ................................~~(,,"> 4.27/7.73 min .................................~~(,,"> 4.38/7.73 min .................................~~(,,"> 4.50/7.73 min ..................................~~(,,"> 4.62/7.73 min ...................................~~(,,"> 4.73/7.73 min ....................................~~(,,"> 4.85/7.73 min .....................................~~(,,"> 4.97/7.72 min ......................................~~(,,"> 5.08/7.72 min .......................................~~(,,"> 5.20/7.72 min ........................................~~(,,"> 5.33/7.75 min .........................................~~(,,"> 5.45/7.75 min ..........................................~~(,,"> 5.57/7.73 min ...........................................~~(,,"> 5.68/7.73 min ............................................~~(,,"> 5.80/7.73 min ............................................~~(,,"> 5.90/7.72 min .............................................~~(,,"> 6.02/7.72 min ..............................................~~(,,"> 6.13/7.72 min ...............................................~~(,,"> 6.25/7.72 min ................................................~~(,,"> 6.37/7.72 min .................................................~~(,,"> 6.48/7.72 min ..................................................~~(,,"> 6.60/7.72 min ...................................................~~(,,"> 6.72/7.72 min ....................................................~~(,,"> 6.83/7.70 min .....................................................~~(,,"> 6.95/7.70 min ......................................................~~(,,"> 7.07/7.70 min ......................................................~~(,,"> 7.18/7.70 min .......................................................~~(,,"> 7.30/7.70 min ........................................................~~(,,"> 7.42/7.70 min .........................................................~~(,,"> 7.55/7.72 min ..........................................................~~(,,"> 7.67/7.72 min ...........................................................~~(,,"> 7.78/7.78 min ............................................................~~(,,"> Maximization ... 000/??? sec ~~(,,"> [oo] 3/ 3 sec ............................................................~~(,,"> Estimating accuracies in the orientational assignment ... 000/??? sec ~~(,,"> [oo] 0/ 0 sec ............................................................~~(,_,"> Auto-refine: Estimated accuracy angles= 999 degrees; offsets= 999 pixels CurrentResolution= 96.4796 Angstroms, which requires orientationSampling of at least 30 degrees for a particle of diameter 360 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 32000 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 4 NrTranslations= 5

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Expectation iteration 2 of 25 000/??? sec ~~(,,"> [oo] 0.55/28.93 min .~~(,,"> 1.08/31.67 min ..~~(,,"> 1.63/33.05 min ..~~(,,"> 2.18/33.78 min ...~~(,,"> 2.72/34.03 min ....~~(,,"> 3.27/34.37 min .....~~(,,"> 3.82/34.62 min ......~~(,,"> 4.37/34.80 min .......~~(,,"> 4.90/34.83 min ........~~(,,"> 5.43/34.87 min .........~~(,,"> 5.98/34.98 min ..........~~(,,"> 6.53/35.07 min ...........~~(,,"> 7.08/35.15 min ............~~(,,"> 7.63/35.23 min ............~~(,,"> 8.18/35.28 min .............~~(,,"> 8.75/35.42 min ..............~~(,,"> 9.32/35.52 min ...............~~(,,"> 9.87/35.55 min ................~~(,,"> 10.43/35.65 min .................~~(,,"> 11.00/35.72 min ..................~~(,,"> 11.57/35.80 min ...................~~(,,"> 12.12/35.82 min ....................~~(,,"> 12.68/35.88 min .....................~~(,,"> 13.25/35.93 min ......................~~(,,"> 13.80/35.95 min .......................~~(,,"> 14.37/36.00 min .......................~~(,,"> 14.93/36.03 min ........................~~(,,"> 15.50/36.08 min .........................~~(,,"> 16.07/36.12 min ..........................~~(,,"> 16.63/36.17 min ...........................~~(,,"> 17.18/36.17 min ............................~~(,,"> 17.72/36.13 min .............................~~(,,"> 18.27/36.13 min ..............................~~(,,"> 18.82/36.13 min ...............................~~(,,"> 19.35/36.10 min ................................~~(,,"> 19.90/36.10 min .................................~~(,,"> 20.58/36.33 min .................................~~(,,"> 21.73/37.37 min ..................................~~(,,"> 22.93/38.43 min ...................................~~(,,"> 24.10/39.38 min ....................................~~(,,"> 25.28/40.32 min .....................................~~(,,"> 26.47/41.20 min ......................................~~(,,"> 27.65/42.05 min .......................................~~(,,"> 28.83/42.85 min ........................................~~(,,"> 30.00/43.60 min .........................................~~(,,"> 31.17/44.32 min ..........................................~~(,,"> 32.33/45.00 min ...........................................~~(,,"> 33.52/45.68 min ............................................~~(,,"> 34.68/46.32 min ............................................~~(,,"> 35.88/46.97 min .............................................~~(,,"> 37.07/47.57 min ..............................................~~(,,"> 38.22/48.10 min ...............................................~~(,,"> 39.42/48.68 min ................................................~~(,,"> 40.55/49.17 min .................................................~~(,,"> 41.72/49.67 min ..................................................~~(,,"> 42.88/50.13 min ...................................................~~(,,"> 44.07/50.62 min ....................................................~~(,,"> 45.23/51.07 min .....................................................~~(,,"> 46.43/51.55 min ......................................................~~(,,"> 47.60/51.97 min ......................................................~~(,,"> 48.75/52.35 min .......................................................~~(,,"> 49.95/52.77 min ........................................................~~(,,"> 51.12/53.15 min .........................................................~~(,,"> 52.30/53.53 min ..........................................................~~(,,"> 53.47/53.90 min ...........................................................~~(,,">-------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 942 on node ig-pc-10 exited on signal 9 (Killed).

=== RELION MPI setup ===
Number of MPI processes = 3
Number of threads per MPI process = 4
Total number of threads therefore = 12
Master (0) runs on host = ig-pc-10.lmb.internal
Slave 1 runs on host = ig-pc-10.lmb.internal
Slave 2 runs on host = ig-pc-10.lmb.internal

Running CPU instructions in double precision.
WARNING: Changing psi sampling rate (before oversampling) to 11.25 degrees, for more efficient GPU calculations
On host ig-pc-10.lmb.internal: free scratch space = 172 Gb. Copying particles to scratch directory: /ssd/relionvolatile/ 000/??? sec ~~(,,"> [oo] 0.08/5.00 min ~~(,,"> 0.17/5.00 min .~~(,,"> 0.25/5.00 min ..~~(,,"> 0.33/5.00 min ...~~(,,"> 0.43/5.20 min ....~~(,,"> 0.52/5.17 min .....~~(,,"> 0.62/5.28 min ......~~(,,"> 0.72/5.37 min .......~~(,,"> 0.80/5.33 min ........~~(,,"> 0.90/5.40 min .........~~(,,"> 0.98/5.35 min ..........~~(,,"> 1.08/5.42 min ...........~~(,,"> 1.18/5.45 min ............~~(,,"> 1.27/5.42 min .............~~(,,"> 1.35/5.40 min ..............~~(,,"> 1.45/5.43 min ...............~~(,,"> 1.55/5.47 min ................~~(,,"> 1.63/5.43 min .................~~(,,"> 1.72/5.42 min ..................~~(,,"> 1.82/5.45 min ...................~~(,,"> 1.90/5.42 min ....................~~(,,"> 2.00/5.45 min .....................~~(,,"> 2.08/5.43 min ......................~~(,,"> 2.18/5.45 min .......................~~(,,"> 2.28/5.47 min ........................~~(,,"> 2.37/5.45 min .........................~~(,,"> 2.48/5.52 min ..........................~~(,,"> 2.57/5.50 min ...........................~~(,,"> 2.67/5.52 min ............................~~(,,"> 2.75/5.50 min .............................~~(,,"> 2.83/5.48 min ..............................~~(,,"> 2.93/5.50 min ...............................~~(,,"> 3.02/5.48 min ................................~~(,,"> 3.10/5.47 min .................................~~(,,"> 3.22/5.50 min ..................................~~(,,"> 3.30/5.50 min ...................................~~(,,"> 3.42/5.53 min ....................................~~(,,"> 3.50/5.52 min .....................................~~(,,"> 3.58/5.50 min ......................................~~(,,"> 3.68/5.52 min .......................................~~(,,"> 3.77/5.50 min ........................................~~(,,"> 3.85/5.50 min .........................................~~(,,"> 3.95/5.50 min ..........................................~~(,,"> 4.05/5.52 min ...........................................~~(,,"> 4.13/5.50 min ............................................~~(,,"> 4.22/5.50 min .............................................~~(,,"> 4.30/5.48 min ..............................................~~(,,"> 4.40/5.50 min ...............................................~~(,,"> 4.48/5.48 min ................................................~~(,,"> 4.58/5.50 min .................................................~~(,,"> 4.68/5.50 min ..................................................~~(,,"> 4.77/5.50 min ...................................................~~(,,"> 4.85/5.48 min ....................................................~~(,,"> 4.93/5.47 min .....................................................~~(,,"> 5.03/5.48 min ......................................................~~(,,"> 5.12/5.47 min .......................................................~~(,,"> 5.22/5.48 min ........................................................~~(,,"> 5.30/5.47 min .........................................................~~(,,"> 5.38/5.47 min ..........................................................~~(,,"> 5.47/5.47 min ...........................................................~~(,,"> 5.47/5.47 min ............................................................~~(,,"> uniqueHost ig-pc-10.lmb.internal has 2 ranks. Slave 1 will distribute threads over devices 0 Thread 0 on slave 1 mapped to device 0 Thread 1 on slave 1 mapped to device 0 Thread 2 on slave 1 mapped to device 0 Thread 3 on slave 1 mapped to device 0 Slave 2 will distribute threads over devices 1 Thread 0 on slave 2 mapped to device 1 Thread 1 on slave 2 mapped to device 1 Thread 2 on slave 2 mapped to device 1 Thread 3 on slave 2 mapped to device 1 Estimating accuracies in the orientational assignment ... 000/??? sec ~~(,,"> [oo] 0/ 0 sec ............................................................~~(,_,"> Auto-refine: Estimated accuracy angles= 999 degrees; offsets= 999 pixels CurrentResolution= 96.4796 Angstroms, which requires orientationSampling of at least 30 degrees for a particle of diameter 360 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 32000 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 4 NrTranslations= 5

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Expectation iteration 2 of 25 000/??? sec ~~(,,"> [oo] 0.55/28.93 min .~~(,,"> 1.10/32.15 min ..~~(,,"> 1.65/33.38 min ..~~(,,"> 2.22/34.30 min ...~~(,,"> 2.78/34.87 min ....~~(,,"> 3.33/35.07 min .....~~(,,"> 3.90/35.38 min ......~~(,,"> 4.45/35.47 min .......~~(,,"> 5.00/35.55 min ........~~(,,"> 5.57/35.72 min .........~~(,,"> 6.13/35.85 min ..........~~(,,"> 6.67/35.78 min ...........~~(,,"> 7.23/35.90 min ............~~(,,"> 7.80/36.00 min ............~~(,,"> 8.35/36.02 min .............~~(,,"> 8.92/36.08 min ..............~~(,,"> 9.48/36.15 min ...............~~(,,"> 10.03/36.15 min ................~~(,,"> 10.62/36.27 min .................~~(,,"> 11.18/36.32 min ..................~~(,,"> 11.77/36.42 min ...................~~(,,"> 12.33/36.45 min ....................~~(,,"> 12.90/36.48 min .....................~~(,,"> 13.48/36.57 min ......................~~(,,"> 14.05/36.60 min .......................~~(,,"> 14.63/36.67 min .......................~~(,,"> 15.20/36.68 min ........................~~(,,"> 15.77/36.70 min .........................~~(,,"> 16.35/36.77 min ..........................~~(,,"> 16.90/36.73 min ...........................~~(,,"> 17.48/36.80 min ............................~~(,,"> 18.05/36.80 min .............................~~(,,"> 18.62/36.82 min ..............................~~(,,"> 19.18/36.83 min ...............................~~(,,"> 19.77/36.88 min ................................~~(,,"> 20.35/36.92 min .................................~~(,,"> 20.92/36.93 min .................................~~(,,"> 21.50/36.97 min ..................................~~(,,"> 22.05/36.95 min ...................................~~(,,"> 22.62/36.95 min ....................................~~(,,"> 23.17/36.93 min .....................................~~(,,"> 23.73/36.95 min ......................................~~(,,"> 24.28/36.92 min .......................................~~(,,"> 24.85/36.93 min ........................................~~(,,"> 25.40/36.92 min .........................................~~(,,"> 25.97/36.92 min ..........................................~~(,,"> 26.52/36.90 min ...........................................~~(,,"> 27.07/36.90 min ............................................~~(,,"> 27.62/36.88 min ............................................~~(,,"> 28.18/36.88 min .............................................~~(,,"> 28.73/36.87 min ..............................................~~(,,"> 29.30/36.88 min ...............................................~~(,,"> 29.87/36.88 min ................................................~~(,,"> 30.42/36.87 min .................................................~~(,,"> 30.98/36.88 min ..................................................~~(,,"> 31.57/36.90 min ...................................................~~(,,"> 32.13/36.92 min ....................................................~~(,,"> 32.68/36.90 min .....................................................~~(,,"> 33.25/36.90 min ......................................................~~(,,"> 33.82/36.92 min ......................................................~~(,,"> 34.38/36.92 min .......................................................~~(,,"> 34.95/36.92 min ........................................................~~(,,"> 35.52/36.93 min .........................................................~~(,,"> 36.07/36.92 min ..........................................................~~(,,"> 36.63/36.92 min ...........................................................~~(,,"> 37.20/37.20 min ............................................................~~(,,"> Maximization ... 000/??? sec ~~(,,"> [oo] 3/ 3 sec ............................................................~~(,,"> Estimating accuracies in the orientational assignment ... 000/??? sec ~~(,,"> [oo] 0/ 0 sec ............................................................~~(,_,"> Auto-refine: Estimated accuracy angles= 999 degrees; offsets= 999 pixels CurrentResolution= 32.1599 Angstroms, which requires orientationSampling of at least 10 degrees for a particle of diameter 360 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 32000 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 4 NrTranslations= 5

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Expectation iteration 3 of 25 000/??? sec ~~(,,"> [oo] 0.75/39.47 min .~~(,,"> 1.53/44.82 min ..~~(,,"> 2.30/46.55 min ..~~(,,"> 3.08/47.72 min ...~~(,,"> 3.95/49.48 min ....~~(,,"> 4.82/50.68 min .....~~(,,"> 5.72/51.87 min ......~~(,,"> 6.55/52.22 min .......~~(,,"> 7.40/52.62 min ........~~(,,"> 8.23/52.83 min .........~~(,,"> 9.03/52.82 min ..........~~(,,"> 9.83/52.80 min ...........~~(,,"> 10.68/53.03 min ............~~(,,"> 11.52/53.15 min ............~~(,,"> 12.35/53.27 min .............~~(,,"> 13.20/53.42 min ..............~~(,,"> 14.07/53.63 min ...............~~(,,"> 14.93/53.82 min ................~~(,,"> 15.82/54.03 min .................~~(,,"> 16.63/54.02 min ..................~~(,,"> 17.47/54.07 min ...................~~(,,"> 18.30/54.10 min ....................~~(,,"> 19.12/54.08 min .....................~~(,,"> 19.97/54.15 min ......................~~(,,"> 20.77/54.08 min .......................~~(,,"> 21.60/54.12 min .......................~~(,,"> 22.43/54.15 min ........................~~(,,"> 23.27/54.17 min .........................~~(,,"> 24.20/54.42 min ..........................~~(,,"> 25.07/54.50 min ...........................~~(,_,">

bforsbe commented 7 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

Seems like Robert's problem is the same that we have in issue #54 / #53. The Relion behavior is exactly the same. Only in this case not just one GPU is stalling but two. Is it a large dataset?

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

Hi @robbmcleod,

Considering that you're not getting any error messages, I doubt this is the same issue addressed by the provided fix.

What happens if you run with only 2 GPUs, do you see a similar behaviour? Do you have other GPU-systems you can reproduce this on? Could you also try with the currently latest GPU drivers (375.26) and see if that helps.

If the issue doesn't occur on our systems there's unfortunately no way it can be directly addressed.

bforsbe commented 7 years ago

Original comment by Robert McLeod (Bitbucket: robbmcleod, GitHub: robbmcleod):

Hi,

We have similar issues on some quad Titan-XP nodes. What happens is that two GPUs finish, and then 1-2 seem to stall. The time estimate for Relion to finish the iteration reaches the estimate and then continues forever. I am struggling, however, on how to get some sort of useful debugging info out of the situation. This job was launched with Relion 2.02 patched with the double-mode fix you supplied above. All CPU threads continue to run at 100 %. We had similar troubles with 2.01.

The particles are somewhat elongated by this iteration but still featureless.

I could probably make a tarball and provide a download link. The particles look to be 31 GB. I'll also ask the graduate student to re-extract with tighter particle picking conditions and see if that helps. However the particle picking looks good to my eye.

#!bash

Thu Dec 15 13:26:08 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN X (Pascal)    Off  | 0000:08:00.0     Off |                  N/A |
| 23%   39C    P2    70W / 250W |  11290MiB / 12189MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN X (Pascal)    Off  | 0000:0C:00.0     Off |                  N/A |
| 23%   23C    P8     8W / 250W |      2MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN X (Pascal)    Off  | 0000:84:00.0     Off |                  N/A |
| 23%   26C    P8     7W / 250W |      2MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN X (Pascal)    Off  | 0000:88:00.0     Off |                  N/A |
| 39%   67C    P2    86W / 250W |  11290MiB / 12189MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3289    C   ...entOS7_SM61/install/bin/relion_refine_mpi  5705MiB |
|    0      3290    C   ...entOS7_SM61/install/bin/relion_refine_mpi  5583MiB |
|    3      3295    C   ...entOS7_SM61/install/bin/relion_refine_mpi  5705MiB |
|    3      3296    C   ...entOS7_SM61/install/bin/relion_refine_mpi  5583MiB |
+-----------------------------------------------------------------------------+

Here is the output of the log.  Stderr should appear in it as well:

cat HDL_2Dr1c1.o21679486
--gpu 0,0,0:0,0,0:1,1,1:1,1,1:2,2,2:2,2,2:3,3,3:3,3,3
 === RELION MPI setup ===
 + Number of MPI processes             = 9
 + Number of threads per MPI process  = 3
 + Total number of threads therefore  = 27
 + Master  (0) runs on host            = sgi25.cluster.bc2.ch
 + Slave     1 runs on host            = sgi25.cluster.bc2.ch
 + Slave     2 runs on host            = sgi25.cluster.bc2.ch
 + Slave     3 runs on host            = sgi25.cluster.bc2.ch
 + Slave     4 runs on host            = sgi25.cluster.bc2.ch
 + Slave     5 runs on host            = sgi25.cluster.bc2.ch
 + Slave     6 runs on host            = sgi25.cluster.bc2.ch
 + Slave     7 runs on host            = sgi25.cluster.bc2.ch
 + Slave     8 runs on host            = sgi25.cluster.bc2.ch
 =================
 Running CPU instructions in double precision. 
 + On host sgi25.cluster.bc2.ch: free scratch space = 1459 Gb.
 Copying particles to scratch directory: /scratch/albste00/relion_volatile/
11.37/11.37 min ............................................................~~(,_,">
 uniqueHost sgi25.cluster.bc2.ch has 8 ranks.
 Using explicit indexing on slave 0 to assign devices  0 0 0
 Thread 0 on slave 1 mapped to device 0
 Thread 1 on slave 1 mapped to device 0
 Thread 2 on slave 1 mapped to device 0
 Using explicit indexing on slave 0 to assign devices  0 0 0
 Thread 0 on slave 2 mapped to device 0
 Thread 1 on slave 2 mapped to device 0
 Thread 2 on slave 2 mapped to device 0
 Using explicit indexing on slave 0 to assign devices  1 1 1
 Thread 0 on slave 3 mapped to device 1
 Thread 1 on slave 3 mapped to device 1
 Thread 2 on slave 3 mapped to device 1
 Using explicit indexing on slave 0 to assign devices  1 1 1
 Thread 0 on slave 4 mapped to device 1
 Thread 1 on slave 4 mapped to device 1
 Thread 2 on slave 4 mapped to device 1
 Using explicit indexing on slave 0 to assign devices  2 2 2
 Thread 0 on slave 5 mapped to device 2
 Thread 1 on slave 5 mapped to device 2
 Thread 2 on slave 5 mapped to device 2
 Using explicit indexing on slave 0 to assign devices  2 2 2
 Thread 0 on slave 6 mapped to device 2
 Thread 1 on slave 6 mapped to device 2
 Thread 2 on slave 6 mapped to device 2
 Using explicit indexing on slave 0 to assign devices  3 3 3
 Thread 0 on slave 7 mapped to device 3
 Thread 1 on slave 7 mapped to device 3
 Thread 2 on slave 7 mapped to device 3
 Using explicit indexing on slave 0 to assign devices  3 3 3
 Thread 0 on slave 8 mapped to device 3
 Thread 1 on slave 8 mapped to device 3
 Thread 2 on slave 8 mapped to device 3
Device 0 on sgi25.cluster.bc2.ch is split between 2 slaves
Device 1 on sgi25.cluster.bc2.ch is split between 2 slaves
Device 2 on sgi25.cluster.bc2.ch is split between 2 slaves
Device 3 on sgi25.cluster.bc2.ch is split between 2 slaves
 Estimating accuracies in the orientational assignment ... 
1.28/1.28 min ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 18.1 degrees; offsets= 6.1 pixels
 CurrentResolution= 6.9 Angstroms, which requires orientationSampling of at least 7.05882 degrees for a particle of diameter 110 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 134400
 OrientationalSampling= 11.25 NrOrientations= 32
 TranslationalSampling= 2 NrTranslations= 21
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 4300800
 OrientationalSampling= 5.625 NrOrientations= 256
 TranslationalSampling= 1 NrTranslations= 84
=============================
 Estimated memory for expectation  step > 11.4269 Gb.
 Estimated memory for maximization step > 0.000615567 Gb.
 Expectation iteration 6 of 25
3.85/3.88 hrs ...........................................................~~(,_,">

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

Perhaps there is a subset of images that are so featureless that no alignment nor class is more favorable than the other. Could you try rerunning from the start with the extra flag --maxsig 20 and 100 classes instead.

bforsbe commented 7 years ago

Original comment by Bharat Reddy (Bitbucket: barureddy, GitHub: barureddy):

Yes, I see some shapes of my particles.

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

It ran for at least 10 iterations, so you should start seeing at least a rough shape of your particle. Have a look at the classes of the last iteration and check whether you see anything. If there's only noise, the refinement is not converging for some reason.

bforsbe commented 7 years ago

Original comment by Bharat Reddy (Bitbucket: barureddy, GitHub: barureddy):

It ran a few more steps but then I get the following error:

WARNING: Exception (filteredSize == 0) handled by switching to fail-safe mode.

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

You'll have to download the attached file and refer to its path. If you download it to the Download directory for instance write:

#!bash
git apply HOME/Download/double-mode.patch

Where "HOME" is the path to you home directory.

bforsbe commented 7 years ago

Original comment by Bharat Reddy (Bitbucket: barureddy, GitHub: barureddy):

I updated my repository, but it says it can't open patch 'double-mode.patch' .

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

Hi @barureddy, did this get resolved? If not please try a recent patch for this particular issue (attached).

To apply it, cd into your local repository and run:

#!bash
git apply double-mode.patch

Remember to update your repository with the latest verison first.

bforsbe commented 7 years ago

Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):

This is a rare issue that I haven't observed mush since the last fix. It seem to occur for odd images when the prior and likelihood functions do not overlap within the single precision limits. Till we fix this issue please do try a different setting for the run. One flag that you may want to try is the maxsig-flag. In 2D you can set the maximum number of significant coarse weights to for instance 500 by providing the additional flag "--maxsig 500".

bforsbe commented 7 years ago

The original issue (stalling execution during last phase of an expectation step) is fixed in v2.0.4. It resulted from using threads in an inherently unsafe manner within the cuda-context. During fast enough runs, this led to occasional deadlock in cuda-internal pthread-mutex locks. This also explins why decreasing pool helps; it decreases thread-based parallelism and the concurrency which caused the stall.

truatpasteurdotfr commented 7 years ago

I have been hitting something which could be this issue, while running relion (2.0.3/openmpi 1.10.6/CentOS-6) benchmaks on a single 16 cores/2 sockets 4x titanX pascal: random (non reproduceable) stalled jobs 100% cpus/gpus (1/2/4)...

bforsbe commented 7 years ago

Sounds like it is. Updating the code according to the instructions on the landing page should eliminate this.

xagirre commented 7 years ago

Hi I see you suggest using the max_sig flag ( to limit the number of orientations the image is examined, correct?). Which numbers are reasonable? (I get a 20 fold speed up using 500 for example, but will affect the quality the of 2D classification?) I can´t find any documentation. I would appreciate if you could comment on this.

bforsbe commented 7 years ago

--maxsig 500 (note: no underscore) will force relion to consider at most 500 orientations for a finer sampling. You still perform an exhaustive (all orientations) according to your settings, so maxsig is a fairly safe optimization strategy. The lower you set maxsig, the more confidence you place in the statement "the best orientational fits are found in the true/correct alignment". Note that the 'fit' is something we calculate to estimate orientation, but which may be noisy or have false peaks. However, I have run with maxsig as low as 5 in 2D without trouble, and as low as ~300 in 3D. If results look strange, I'd recommend validation, by running again with a higher maxsig, but generally it does not significantly alter the results, unless your data is really poor quality.

davboyer commented 6 years ago

Hello @bforsbe ,

I am using Relion 2.1.0 with OpenMPI 3.0.0 and I believe I am having all the symptoms of this problem.

When I run large 2D classifications (800,000 particles, box size 320), I get hang ups on one GPU as described above while the others are inactive.

This problem seems to be stochastic, as I have run a very similar job with the same particles, and it went through 25 iterations.

However, after it happens once and I cancel the job, subsequent jobs seem to recapitulate the problem, and for a few days, many jobs I run with more than 50k or 100k particles get stuck at one of the expectation iterations. There is no error message, and the job hangs up indefinitely.

I am currently trying the a job that just suffered from this problem with the --pool 1 option to see if it helps.

I know you said this problem was fixed as of v2.0.4, but I am still struggling with it. Any advice? Thanks!

bforsbe commented 6 years ago

Did you try rebooting the computer? Sounds like a memory (RAM) issue.

davboyer commented 6 years ago

@bforsbe Thanks for the quick reply.

I will have system administrator reset our GPU nodes tomorrow to see if it helps, and I'll get back to you. To help my understanding, could you perhaps expand on how a RAM issue would cause this type of problem?

bforsbe commented 6 years ago

RELION is not completely deterministic in its results, but getting any error completely stochastically rather hints at some hardware-flaw that you hit. It could still be in the algorithm, but a lot of people are using the algorithm, and only you are using your hardware.

That being said, I don't actually know what error you are getting. This is a long and convoluted thread and "as the above" is unfortunately not very descriptive, in part because at least one false observation was given at some point.

davboyer commented 6 years ago

@bforsbe My apologies for the lack of clarity. I believe my issue is more similar to #53 and #54 (please let me know how to move my comments if that would help clarity).

It appears in issue #53 that Ozkan Yildiz had the same problem I was having; namely, the 2D classification would stall at the very last part of the expectation step with no error message. 1 GPU would show 100% utilization (although low wattage) while the others have 0% utilization and no memory used. Ozkan seemed to have solved this problem by using a different data set before the discussion diverged into talking about GPU indexing... Although this problem seemed to manifest again for Ozkan (comment at the end of issue #53), and was solved by omitting the flag --dont_combine_weights_via_disc in Relion 1.4. However, Ozkan comments on December 16, 2016 (issue #54) that the problem has resurfaced and that they resorted to using CPU's to avoid this problem all together (comment in issue #54 towards middle of page). Finally, Ozkan mentioned that using --pool 1 alleviated the stalling GPU problem.

In issue #54, @cdk reported that they were having a problem with a job crashing, yet a GPU still running. This would lead users to starting new jobs and getting memory allocation errors. However, @cdk did not follow-up with much information regarding the current state of that problem for them.

At the end of #54, @bforsbe stated that "Using v2.0.4 you should no longer find zombies, as the issue was identified and fixed. It was caused by not-so cautious use of htread-parallelism in a single cuda-context, which in turn caused some dead-locks in internal cuda-threads. The zombie was thus a spin-lock in cuda waiting to return an API-call."

However, I am not sure if there is a difference between a zombie still running on the GPU after a job "crashes" and what I am seeing where the job never technically "crashes," but remains stuck at the last part of an expectation step indefinitely with one GPU at 100% and all others going to 0%.

I have re-started our GPU nodes, and re-run the job starting at the iteration immediately before the crash (iteration 3). The job is now on iteration 6 and seems to be operating successfully. I tracked the buff/cache memory over the course of the run using the command free as I was curious about the RAM. After reboot, buff/cache was empty, and over the several iterations it is now close to ~114-120 GB on all the GPU nodes (we have 128 GB of RAM on each GPU node). Thank you for your tip regarding rebooting the machines.

Is there any other information I can provide to you so that we may nail down the problem? I have run many successful 2D and 3D classifications; however, this problem is the only one I get in Relion that doesn't produce an error message and leaves me sort of clueless as to what is happening. Any help is appreciated. Thanks!

bforsbe commented 6 years ago

I'm pretty certain your issue is not a lockup as fixed in issue #54, just to have that said.

Relion dispatches mpi-slaves and waits for them to return before continuing. Anytime a slave get stuck doing anything, the symptoms are largely the same, which is why many issues look the same. The interesting observation in your case is the 100% gpu util.

Now, that does not mean that they are working hard, just that the slave got stuck in a state where the gpu was busy. So network dropouts or RAM-swapping is actually quite unlikely. Thinking about it again, the gpu-driver or cuda-runtime are more likely to have dropped out or ended up in a weird state. The first debug for this is, yep, a reboot.

I don't like to say that this is the issue, because it's not something I can fix. But it's looking likely, unless you have further issues. Let me know if that's the case and we'll try to dig deeper.

Thanks for the simply splendid clarification! Your in my good book.

davboyer commented 6 years ago

@bforsbe Thanks for the reply! No problem, writing the summary of the other issues actually helped me clarify the issue myself.

Just for completeness, I will post my newest set of experiments that may help confirm the issue. All of these experiments were run across 7 nodes, each with one GPU card (2 1080's and 5 1080 Ti's), using mpi and 8 mpi processes per card (except the first node which has 7 procs per GPU due to mpi master being on this node; I am using SLURM, therefore the first cpu gets assigned to master, and the other seven go to slaves on node 1). The particles.star file contained ~800k particles with a box size of 320 pixels (1.07 A/pix).

To determine if rebooting computers solved the problem, we rebooted all the nodes and ran my 2D classification. The result was that the job ran successfully through 25 iterations.

To determine if having buffer/cache "full" at start of job is causing problems versus some cuda or gpu-driver problem, we re-ran the same 2D classification without doing anything additional to the computers. The buff/cache memory was almost full (~115GB on 128 GB RAM nodes) at the start of the job, and I checked to make sure there were no zombie processes on any node before running the job. The result was that the classification hangs at the end of iteration 4 with all the usual symptoms (2 of the 1080 Ti's stalling at end of expectation with 100% GPU util, 75W/280W, and 10Gb/11Gb GPU memory used. All other GPU's are empty, 0% util, no memory used, etc.).

To further investigate the role of RAM in this problem, we cleared the caches without rebooting the nodes and performed the same classification (again, checking for any sort of zombie processes). The result was that the classification hangs at the end of expectation 2. This time with a different 1080 Ti hanging at 100% util. (Note: I have performed similar experiments, and it is never the same GPU that stalls at the end of the expectation, the problem appears to happen indiscriminately among 1080's and 1080 Ti's among the nodes).

Does this to you vindicate any sort of RAM issue causing the problem and place the blame again on the gpu-driver or cuda-runtime doing something weird at the last part of expectation step? Any additional thoughts? Thanks!

bforsbe commented 6 years ago

8 MPI-ranks per GPU is more than I have ever run. For 2D-classification, you CAN of course do that because the memory footprint of 2D-images is so much less than that of 3D-models, but I would think that there's not much performance to get from having that many.

I can try to reproduce by simply crowding a GPU with an insane number of MPIs and see if there's some contention or race-condition we missed across MPIs. My advice in the meantime is to scale back to one or two MPIs per GPU and pack on the threads instead.

For 2D, I tend to just

Downscale particles to ~50-100 pixels,
Not use MPI at all (stick to one node) (you can use many GPUs without MPI, as long as you have multiple threads)
Use a subset-setting corresponding to ~100-200 particles/class for ~5 iterations
Set the --maxsig flag to ~5-20

With those settings, most things fly through pretty darn quick, even on one GPU. It avoids MPI-communication overhead and RAM-caches everything. Gets the job done. Some choose to linger and perfect their 2D-classes, but this is time wasted in my mind, especially for initial cleaning of otherwise good data.

davboyer commented 6 years ago

@bforsbe Thanks for the suggestions! I talked to a very experienced colleague yesterday, and he says he usually has does 3-4 MPI procs per GPU for 2D with no problems (but he definitely did not like the idea of 8 mpi per gpu!). I will definitely lower the number of mpi procs per GPU in the future, and try your other suggestions.

davboyer commented 6 years ago

Hi @bforsbe ,

I am wondering if there is a way to include "-G" (big G) option for nvcc for cuda kernel debugging? I am still debugging this issue. It doesn't seem that DCMAKE_BUILD_TYPE=Debug enables -G for nvcc unless I'm mistaken...

My current workaround is to use --continue when jobs get stuck on an iteration. This usually pushes them through until they finish, or if they get stuck again at a later iteration, then I do --continue again until all iterations are done.

bforsbe commented 6 years ago

read line 5 of Cmake/BuildTypes.cmake.

Of note, you might also have to set -DBUILD_SHARED_LIBS=OFF for gdb to read the symbols ok.

adil-20 commented 4 years ago

Hi guys, I hope you doing well all, I new on using RELION 3.1, I'm facing some issues on using 2D classification on RELION 3.1 I did follow the installation and try to run the GUI it did work but when I tried to run the 2D classification it did not work, I wen back to the comment line approach and download the benchmark dataset with the STAR file and when I tried to run the following comment:

mpirun -n XXX which relion_refine_mpi --i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d

I got the following error: Screenshot from 2020-09-26 15-38-19

please could someone help me how to run the 2D classification in RELION 3.1 really appreciate that! @bforsbe

davboyer commented 4 years ago

Hi, you cannot use the XXX when you run from the command line. You need to specify an actual number of mpi tasks. XXX type variables only work when you are using a template script for the Relion GUI to fill in the variables with values from the gui.

On Sat, Sep 26, 2020 at 1:40 PM adil-20 notifications@github.com wrote:

Hi guys, I hope you doing well all, I new on using RELION 3.1, I'm facing some issues on using 2D classification on RELION 3.1 I did follow the installation and try to run the GUI it did work but when I tried to run the 2D classification it did not work, I wen back to the comment line approach and download the benchmark dataset with the STAR file and when I tried to run the following comment:

mpirun -n XXX which relion_refine_mpi --i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d

I got the following error: [image: Screenshot from 2020-09-26 15-38-19] https://user-images.githubusercontent.com/71946383/94349869-69d45f80-000e-11eb-9aca-8587195397a3.png

please could someone help me how to run the 2D classification in RELION 3.1 really appreciate that! @bforsbe https://github.com/bforsbe

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/3dem/relion/issues/166#issuecomment-699545564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDIR24FULQUNE4VOEFCC7DSHZGTPANCNFSM4DCTAVOA .

3dem / relion

2D Classification Stalling on GPU #166

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

mpirun noticed that process rank 1 with PID 8750 on node ig-pc-10 exited on signal 11 (Segmentation fault).

mpirun noticed that process rank 2 with PID 16494 on node ig-pc-10 exited on signal 11 (Segmentation fault).

Master (0) runs on host = ig-pc-10.lmb.internal

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Slave 2 runs on host = ig-pc-10.lmb.internal

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20

Oversampling= 1 NrHiddenVariableSamplingPoints= 1024000 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2 NrTranslations= 20