Closed bforsbe closed 7 years ago
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
Could you repeat the same run as above with --pool 1 and check its behaviour?
Original comment by Robert McLeod (Bitbucket: robbmcleod, GitHub: robbmcleod):
Typically 100. These nodes have 512 GB of system memory so a lack of allocatable memory should not be an issue in our case.
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
Our current status to this problem is the following:
The problem appears to be more likely with more pooled particles.
If we set --pool 1
, then the problem does not seem to exist any more and the 2D classification works fine. We have never encountered such a problem when using --pool 1
now.
Before, we used to take values like --pool 50
for 2D classification and it seems that the problem of a stalling GPU occurred much more often.
@Robert McLeod, how many particles did you pool for the run you posted here?
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
Hi Dari,
I have now created a new issue (#189), which I hope is ok.
Best wishes James
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
Hi Dari,
I did read the original issue and I thought initially that it was the same. Then I thought it made sense to continue in the same place rather than spreading things everywhere. In hindsight I agree that wasn't a good idea and I'm sorry.
I will post a new issue and attach the files there as you say.
Best wishes James
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
Hi @james_krieger! Please read the original issue description next time before hijacking it. Your issue is not related to this one and will certainly make things confusing for future references. Post this report in a new issue and attach the (small sized) output files from the run.
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
Actually this issue is still there. It now got the following error at the start of iteration 14:
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
ok I cloned and installed relion2-beta again to be sure. It's not running fine and has reached iteration 3 with 2 GPUs so it looks like I did hit a fixed issue.
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
I just thought that I'm not using the latest beta version necessarily - I am using relion-devel-lmb with the cmake-3.7 patch applied. I am now installing the latest relion2-beta, which I cloned from git a few days ago and applied the cmake-3.7 patch to as well. It could be that I am reporting an issue that has already be resolved and in that case sorry.
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
Also to help interpret that, I should mention that my system maps GPUs wrong so if I ask for the GPU with ID 0 then nvidia-smi shows that GPU #2 has been activated. The same happens when running gromacs-2016.
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
For 3 GPUs, I have -np 4 and --j 4 --gpu "0:1:2" For 2 GPUs, I have -np 3 and --j 4 --gpu "0:1"
In both cases, I am specifying which GPU for all slave ranks and the assignment is showing up correctly in the run.out files.
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
I forgot to mention that I eventually had to kill the job when it had not done something for 20-30 minutes and then I continued it from the end of iteration 1.
Original comment by james_krieger (Bitbucket: james_krieger, GitHub: Unknown):
I have experienced a similar problem to Robert on my system with 3 Titan X Maxwells when I run two of them. Relion stalls at the end of iteration 2 without any error and one of the GPUs holds data in memory but doesn't appear to be doing anything and the run.out doesn't write anything after shortly before the expected end of the iteration. I paste some monitoring data (nvidia-smi, sensors and mpstat) followed by the run.out, which also shows that I can continue the run and complete and start iteration 3. I have not seen this issue using when using all three GPUs but I have also seen the following error in run.out files in both cases (which I paste before the monitoring for the stalling). It should probably be noted that the all three GPUs case has allowed me to run a full set of 25 iterations. All these runs are using the Benchmark dataset.
Thu Jan 19 13:44:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 53% 83C P2 144W / 250W | 11495MiB / 12206MiB | 88% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 38% 79C P2 146W / 250W | 11495MiB / 12206MiB | 77% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +40.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +40.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:44:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
13:44:51 all 12.62 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.37
13:44:51 0 14.97 0.00 3.83 0.12 0.00 0.00 0.00 0.00 81.07
13:44:51 1 14.93 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.24
13:44:51 2 15.23 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.11
13:44:51 3 14.99 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.52
13:44:51 4 15.47 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.28
13:44:51 5 15.92 0.00 2.99 0.02 0.00 0.00 0.00 0.00 81.07
13:44:51 6 15.80 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.33
13:44:51 7 19.94 0.00 2.22 0.66 0.00 0.00 0.00 0.00 77.17
13:44:51 8 11.36 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.33
13:44:51 9 11.89 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.84
13:44:51 10 11.10 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.84
13:44:51 11 10.27 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.92
13:44:51 12 9.31 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.23
13:44:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84
13:44:51 14 7.93 0.00 1.91 1.24 0.00 0.01 0.00 0.00 88.91
13:44:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10
Thu Jan 19 13:45:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 53% 83C P2 141W / 250W | 11495MiB / 12206MiB | 62% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 38% 79C P2 149W / 250W | 11495MiB / 12206MiB | 77% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +42.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +36.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:45:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
13:45:51 all 12.63 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.35
13:45:51 0 15.00 0.00 3.84 0.12 0.00 0.00 0.00 0.00 81.04
13:45:51 1 14.95 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.22
13:45:51 2 15.26 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.09
13:45:51 3 15.01 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.50
13:45:51 4 15.49 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.26
13:45:51 5 15.98 0.00 2.99 0.01 0.00 0.00 0.00 0.00 81.02
13:45:51 6 15.84 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.29
13:45:51 7 19.97 0.00 2.22 0.66 0.00 0.00 0.00 0.00 77.14
13:45:51 8 11.38 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.31
13:45:51 9 11.91 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.83
13:45:51 10 11.12 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.83
13:45:51 11 10.28 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.91
13:45:51 12 9.32 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.22
13:45:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84
13:45:51 14 7.93 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.91
13:45:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10
Thu Jan 19 13:46:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 53% 83C P2 143W / 250W | 11495MiB / 12206MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 38% 78C P2 151W / 250W | 11495MiB / 12206MiB | 77% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +48.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +42.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +48.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +37.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:46:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
13:46:51 all 12.65 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.33
13:46:51 0 15.02 0.00 3.84 0.12 0.00 0.00 0.00 0.00 81.02
13:46:51 1 14.97 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.20
13:46:51 2 15.28 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.07
13:46:51 3 15.03 0.00 3.43 0.06 0.00 0.00 0.00 0.00 81.48
13:46:51 4 15.51 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.24
13:46:51 5 16.00 0.00 2.99 0.01 0.00 0.00 0.00 0.00 81.00
13:46:51 6 15.91 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.22
13:46:51 7 19.99 0.00 2.22 0.66 0.00 0.00 0.00 0.00 77.12
13:46:51 8 11.40 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.29
13:46:51 9 11.91 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.82
13:46:51 10 11.13 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.81
13:46:51 11 10.30 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.89
13:46:51 12 9.33 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.21
13:46:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84
13:46:51 14 7.93 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.92
13:46:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10
Thu Jan 19 13:47:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 53% 83C P2 138W / 250W | 11495MiB / 12206MiB | 82% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 37% 70C P2 76W / 250W | 11495MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +44.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +39.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +40.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +37.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:47:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
13:47:51 all 12.67 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.31
13:47:51 0 15.04 0.00 3.84 0.12 0.00 0.00 0.00 0.00 81.00
13:47:51 1 14.99 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.18
13:47:51 2 15.30 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.04
13:47:51 3 15.04 0.00 3.43 0.06 0.00 0.00 0.00 0.00 81.48
13:47:51 4 15.52 0.00 3.23 0.02 0.00 0.00 0.00 0.00 81.23
13:47:51 5 16.02 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.97
13:47:51 6 15.95 0.00 2.81 0.06 0.00 0.00 0.00 0.00 81.17
13:47:51 7 20.00 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.11
13:47:51 8 11.41 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.28
13:47:51 9 11.92 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.81
13:47:51 10 11.14 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.80
13:47:51 11 10.37 0.00 2.80 0.01 0.00 0.00 0.00 0.00 86.83
13:47:51 12 9.33 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.21
13:47:51 13 8.16 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.84
13:47:51 14 7.92 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.93
13:47:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.10
Thu Jan 19 13:48:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 53C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 52% 82C P2 142W / 250W | 11495MiB / 12206MiB | 80% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 31% 63C P2 74W / 250W | 11495MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +35.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +43.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +38.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +35.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:48:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
13:48:51 all 12.68 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.30
13:48:51 0 15.05 0.00 3.84 0.12 0.00 0.00 0.00 0.00 80.99
13:48:51 1 15.00 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.16
13:48:51 2 15.32 0.00 3.59 0.07 0.00 0.00 0.00 0.00 81.02
13:48:51 3 15.02 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.49
13:48:51 4 15.55 0.00 3.24 0.02 0.00 0.00 0.00 0.00 81.20
13:48:51 5 16.02 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.97
13:48:51 6 15.99 0.00 2.82 0.06 0.00 0.00 0.00 0.00 81.12
13:48:51 7 20.00 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.11
13:48:51 8 11.40 0.00 3.28 0.02 0.00 0.01 0.00 0.00 85.29
13:48:51 9 11.93 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.80
13:48:51 10 11.14 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.80
13:48:51 11 10.45 0.00 2.79 0.01 0.00 0.00 0.00 0.00 86.75
13:48:51 12 9.32 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.22
13:48:51 13 8.15 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.85
13:48:51 14 7.91 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.94
13:48:51 15 4.61 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.11
Thu Jan 19 13:49:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 53C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 51% 82C P2 148W / 250W | 11495MiB / 12206MiB | 78% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 28% 62C P2 74W / 250W | 11495MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +45.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +34.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +35.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:49:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
13:49:51 all 12.69 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.29
13:49:51 0 15.07 0.00 3.84 0.12 0.00 0.00 0.00 0.00 80.97
13:49:51 1 15.01 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.16
13:49:51 2 15.39 0.00 3.59 0.07 0.00 0.00 0.00 0.00 80.96
13:49:51 3 15.03 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.49
13:49:51 4 15.61 0.00 3.25 0.02 0.00 0.00 0.00 0.00 81.13
13:49:51 5 16.04 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.95
13:49:51 6 16.00 0.00 2.82 0.06 0.00 0.00 0.00 0.00 81.12
13:49:51 7 20.00 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.10
13:49:51 8 11.40 0.00 3.27 0.02 0.00 0.01 0.00 0.00 85.29
13:49:51 9 11.94 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.80
13:49:51 10 11.13 0.00 3.05 0.00 0.00 0.00 0.00 0.00 85.82
13:49:51 11 10.45 0.00 2.79 0.01 0.00 0.00 0.00 0.00 86.74
13:49:51 12 9.32 0.00 2.46 0.00 0.00 0.00 0.00 0.00 88.23
13:49:51 13 8.14 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.86
13:49:51 14 7.91 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.95
13:49:51 15 4.60 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.11
Thu Jan 19 13:50:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.23 Driver Version: 370.23 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:05:00.0 On | N/A |
| 22% 52C P8 17W / 250W | 93MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:06:00.0 Off | N/A |
| 50% 82C P2 137W / 250W | 11495MiB / 12206MiB | 75% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:09:00.0 Off | N/A |
| 27% 62C P2 74W / 250W | 11495MiB / 12206MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15163 G /usr/bin/Xorg 89MiB |
| 1 943 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
| 2 942 C ...mb_copy_newPC/build/bin/relion_refine_mpi 11492MiB |
+-----------------------------------------------------------------------------+
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 0: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 1: +33.0°C (high = +87.0°C, crit = +97.0°C)
Core 2: +45.0°C (high = +87.0°C, crit = +97.0°C)
Core 3: +36.0°C (high = +87.0°C, crit = +97.0°C)
Core 4: +46.0°C (high = +87.0°C, crit = +97.0°C)
Core 5: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 6: +37.0°C (high = +87.0°C, crit = +97.0°C)
Core 7: +37.0°C (high = +87.0°C, crit = +97.0°C)
Linux 2.6.32-642.13.1.el6.x86_64 (ig-pc-10.lmb.internal) 19/01/17 _x8664 (16 CPU)
13:50:51 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 13:50:51 all 12.70 0.00 2.87 0.15 0.00 0.00 0.00 0.00 84.28 13:50:51 0 15.07 0.00 3.84 0.12 0.00 0.00 0.00 0.00 80.97 13:50:51 1 15.03 0.00 3.78 0.05 0.00 0.01 0.00 0.00 81.14 13:50:51 2 15.46 0.00 3.59 0.07 0.00 0.00 0.00 0.00 80.89 13:50:51 3 15.05 0.00 3.42 0.06 0.00 0.00 0.00 0.00 81.46 13:50:51 4 15.66 0.00 3.27 0.02 0.00 0.00 0.00 0.00 81.05 13:50:51 5 16.06 0.00 2.99 0.01 0.00 0.00 0.00 0.00 80.93 13:50:51 6 16.01 0.00 2.82 0.06 0.00 0.00 0.00 0.00 81.10 13:50:51 7 20.01 0.00 2.23 0.66 0.00 0.00 0.00 0.00 77.10 13:50:51 8 11.39 0.00 3.27 0.02 0.00 0.01 0.00 0.00 85.30 13:50:51 9 11.93 0.00 3.26 0.00 0.00 0.00 0.00 0.00 84.81 13:50:51 10 11.12 0.00 3.04 0.00 0.00 0.00 0.00 0.00 85.83 13:50:51 11 10.45 0.00 2.79 0.01 0.00 0.00 0.00 0.00 86.75 13:50:51 12 9.31 0.00 2.45 0.00 0.00 0.00 0.00 0.00 88.24 13:50:51 13 8.13 0.00 2.00 0.00 0.00 0.00 0.00 0.00 89.87 13:50:51 14 7.90 0.00 1.90 1.24 0.00 0.01 0.00 0.00 88.95 13:50:51 15 4.60 0.00 1.25 0.03 0.00 0.00 0.00 0.00 94.12
=== RELION MPI setup ===
=== RELION MPI setup ===
Running CPU instructions in double precision.
Expectation iteration 3 of 25 000/??? sec ~~(,,"> [oo] 0.75/39.47 min .~~(,,"> 1.53/44.82 min ..~~(,,"> 2.30/46.55 min ..~~(,,"> 3.08/47.72 min ...~~(,,"> 3.95/49.48 min ....~~(,,"> 4.82/50.68 min .....~~(,,"> 5.72/51.87 min ......~~(,,"> 6.55/52.22 min .......~~(,,"> 7.40/52.62 min ........~~(,,"> 8.23/52.83 min .........~~(,,"> 9.03/52.82 min ..........~~(,,"> 9.83/52.80 min ...........~~(,,"> 10.68/53.03 min ............~~(,,"> 11.52/53.15 min ............~~(,,"> 12.35/53.27 min .............~~(,,"> 13.20/53.42 min ..............~~(,,"> 14.07/53.63 min ...............~~(,,"> 14.93/53.82 min ................~~(,,"> 15.82/54.03 min .................~~(,,"> 16.63/54.02 min ..................~~(,,"> 17.47/54.07 min ...................~~(,,"> 18.30/54.10 min ....................~~(,,"> 19.12/54.08 min .....................~~(,,"> 19.97/54.15 min ......................~~(,,"> 20.77/54.08 min .......................~~(,,"> 21.60/54.12 min .......................~~(,,"> 22.43/54.15 min ........................~~(,,"> 23.27/54.17 min .........................~~(,,"> 24.20/54.42 min ..........................~~(,,"> 25.07/54.50 min ...........................~~(,_,">
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
Seems like Robert's problem is the same that we have in issue #54 / #53. The Relion behavior is exactly the same. Only in this case not just one GPU is stalling but two. Is it a large dataset?
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
Hi @robbmcleod,
Considering that you're not getting any error messages, I doubt this is the same issue addressed by the provided fix.
What happens if you run with only 2 GPUs, do you see a similar behaviour? Do you have other GPU-systems you can reproduce this on? Could you also try with the currently latest GPU drivers (375.26) and see if that helps.
If the issue doesn't occur on our systems there's unfortunately no way it can be directly addressed.
Original comment by Robert McLeod (Bitbucket: robbmcleod, GitHub: robbmcleod):
Hi,
We have similar issues on some quad Titan-XP nodes. What happens is that two GPUs finish, and then 1-2 seem to stall. The time estimate for Relion to finish the iteration reaches the estimate and then continues forever. I am struggling, however, on how to get some sort of useful debugging info out of the situation. This job was launched with Relion 2.02 patched with the double-mode fix you supplied above. All CPU threads continue to run at 100 %. We had similar troubles with 2.01.
The particles are somewhat elongated by this iteration but still featureless.
I could probably make a tarball and provide a download link. The particles look to be 31 GB. I'll also ask the graduate student to re-extract with tighter particle picking conditions and see if that helps. However the particle picking looks good to my eye.
#!bash
Thu Dec 15 13:26:08 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:08:00.0 Off | N/A |
| 23% 39C P2 70W / 250W | 11290MiB / 12189MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 0000:0C:00.0 Off | N/A |
| 23% 23C P8 8W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 0000:84:00.0 Off | N/A |
| 23% 26C P8 7W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 0000:88:00.0 Off | N/A |
| 39% 67C P2 86W / 250W | 11290MiB / 12189MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3289 C ...entOS7_SM61/install/bin/relion_refine_mpi 5705MiB |
| 0 3290 C ...entOS7_SM61/install/bin/relion_refine_mpi 5583MiB |
| 3 3295 C ...entOS7_SM61/install/bin/relion_refine_mpi 5705MiB |
| 3 3296 C ...entOS7_SM61/install/bin/relion_refine_mpi 5583MiB |
+-----------------------------------------------------------------------------+
Here is the output of the log. Stderr should appear in it as well:
cat HDL_2Dr1c1.o21679486
--gpu 0,0,0:0,0,0:1,1,1:1,1,1:2,2,2:2,2,2:3,3,3:3,3,3
=== RELION MPI setup ===
+ Number of MPI processes = 9
+ Number of threads per MPI process = 3
+ Total number of threads therefore = 27
+ Master (0) runs on host = sgi25.cluster.bc2.ch
+ Slave 1 runs on host = sgi25.cluster.bc2.ch
+ Slave 2 runs on host = sgi25.cluster.bc2.ch
+ Slave 3 runs on host = sgi25.cluster.bc2.ch
+ Slave 4 runs on host = sgi25.cluster.bc2.ch
+ Slave 5 runs on host = sgi25.cluster.bc2.ch
+ Slave 6 runs on host = sgi25.cluster.bc2.ch
+ Slave 7 runs on host = sgi25.cluster.bc2.ch
+ Slave 8 runs on host = sgi25.cluster.bc2.ch
=================
Running CPU instructions in double precision.
+ On host sgi25.cluster.bc2.ch: free scratch space = 1459 Gb.
Copying particles to scratch directory: /scratch/albste00/relion_volatile/
11.37/11.37 min ............................................................~~(,_,">
uniqueHost sgi25.cluster.bc2.ch has 8 ranks.
Using explicit indexing on slave 0 to assign devices 0 0 0
Thread 0 on slave 1 mapped to device 0
Thread 1 on slave 1 mapped to device 0
Thread 2 on slave 1 mapped to device 0
Using explicit indexing on slave 0 to assign devices 0 0 0
Thread 0 on slave 2 mapped to device 0
Thread 1 on slave 2 mapped to device 0
Thread 2 on slave 2 mapped to device 0
Using explicit indexing on slave 0 to assign devices 1 1 1
Thread 0 on slave 3 mapped to device 1
Thread 1 on slave 3 mapped to device 1
Thread 2 on slave 3 mapped to device 1
Using explicit indexing on slave 0 to assign devices 1 1 1
Thread 0 on slave 4 mapped to device 1
Thread 1 on slave 4 mapped to device 1
Thread 2 on slave 4 mapped to device 1
Using explicit indexing on slave 0 to assign devices 2 2 2
Thread 0 on slave 5 mapped to device 2
Thread 1 on slave 5 mapped to device 2
Thread 2 on slave 5 mapped to device 2
Using explicit indexing on slave 0 to assign devices 2 2 2
Thread 0 on slave 6 mapped to device 2
Thread 1 on slave 6 mapped to device 2
Thread 2 on slave 6 mapped to device 2
Using explicit indexing on slave 0 to assign devices 3 3 3
Thread 0 on slave 7 mapped to device 3
Thread 1 on slave 7 mapped to device 3
Thread 2 on slave 7 mapped to device 3
Using explicit indexing on slave 0 to assign devices 3 3 3
Thread 0 on slave 8 mapped to device 3
Thread 1 on slave 8 mapped to device 3
Thread 2 on slave 8 mapped to device 3
Device 0 on sgi25.cluster.bc2.ch is split between 2 slaves
Device 1 on sgi25.cluster.bc2.ch is split between 2 slaves
Device 2 on sgi25.cluster.bc2.ch is split between 2 slaves
Device 3 on sgi25.cluster.bc2.ch is split between 2 slaves
Estimating accuracies in the orientational assignment ...
1.28/1.28 min ............................................................~~(,_,">
Auto-refine: Estimated accuracy angles= 18.1 degrees; offsets= 6.1 pixels
CurrentResolution= 6.9 Angstroms, which requires orientationSampling of at least 7.05882 degrees for a particle of diameter 110 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 134400
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 2 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 4300800
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 1 NrTranslations= 84
=============================
Estimated memory for expectation step > 11.4269 Gb.
Estimated memory for maximization step > 0.000615567 Gb.
Expectation iteration 6 of 25
3.85/3.88 hrs ...........................................................~~(,_,">
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
It ran for at least 10 iterations, so you should start seeing at least a rough shape of your particle. Have a look at the classes of the last iteration and check whether you see anything. If there's only noise, the refinement is not converging for some reason.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
You'll have to download the attached file and refer to its path. If you download it to the Download directory for instance write:
#!bash
git apply HOME/Download/double-mode.patch
Where "HOME" is the path to you home directory.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
Hi @barureddy, did this get resolved? If not please try a recent patch for this particular issue (attached).
To apply it, cd into your local repository and run:
#!bash
git apply double-mode.patch
Remember to update your repository with the latest verison first.
Original comment by Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius):
This is a rare issue that I haven't observed mush since the last fix. It seem to occur for odd images when the prior and likelihood functions do not overlap within the single precision limits. Till we fix this issue please do try a different setting for the run. One flag that you may want to try is the maxsig-flag. In 2D you can set the maximum number of significant coarse weights to for instance 500 by providing the additional flag "--maxsig 500".
The original issue (stalling execution during last phase of an expectation step) is fixed in v2.0.4. It resulted from using threads in an inherently unsafe manner within the cuda-context. During fast enough runs, this led to occasional deadlock in cuda-internal pthread-mutex locks. This also explins why decreasing pool helps; it decreases thread-based parallelism and the concurrency which caused the stall.
I have been hitting something which could be this issue, while running relion (2.0.3/openmpi 1.10.6/CentOS-6) benchmaks on a single 16 cores/2 sockets 4x titanX pascal: random (non reproduceable) stalled jobs 100% cpus/gpus (1/2/4)...
Sounds like it is. Updating the code according to the instructions on the landing page should eliminate this.
Hi I see you suggest using the max_sig flag ( to limit the number of orientations the image is examined, correct?). Which numbers are reasonable? (I get a 20 fold speed up using 500 for example, but will affect the quality the of 2D classification?) I can´t find any documentation. I would appreciate if you could comment on this.
--maxsig 500
(note: no underscore) will force relion to consider at most 500 orientations for a finer sampling. You still perform an exhaustive (all orientations) according to your settings, so maxsig is a fairly safe optimization strategy. The lower you set maxsig, the more confidence you place in the statement "the best orientational fits are found in the true/correct alignment". Note that the 'fit' is something we calculate to estimate orientation, but which may be noisy or have false peaks. However, I have run with maxsig as low as 5 in 2D without trouble, and as low as ~300 in 3D. If results look strange, I'd recommend validation, by running again with a higher maxsig, but generally it does not significantly alter the results, unless your data is really poor quality.
Hello @bforsbe ,
I am using Relion 2.1.0 with OpenMPI 3.0.0 and I believe I am having all the symptoms of this problem.
When I run large 2D classifications (800,000 particles, box size 320), I get hang ups on one GPU as described above while the others are inactive.
This problem seems to be stochastic, as I have run a very similar job with the same particles, and it went through 25 iterations.
However, after it happens once and I cancel the job, subsequent jobs seem to recapitulate the problem, and for a few days, many jobs I run with more than 50k or 100k particles get stuck at one of the expectation iterations. There is no error message, and the job hangs up indefinitely.
I am currently trying the a job that just suffered from this problem with the --pool 1 option to see if it helps.
I know you said this problem was fixed as of v2.0.4, but I am still struggling with it. Any advice? Thanks!
Did you try rebooting the computer? Sounds like a memory (RAM) issue.
@bforsbe Thanks for the quick reply.
I will have system administrator reset our GPU nodes tomorrow to see if it helps, and I'll get back to you. To help my understanding, could you perhaps expand on how a RAM issue would cause this type of problem?
RELION is not completely deterministic in its results, but getting any error completely stochastically rather hints at some hardware-flaw that you hit. It could still be in the algorithm, but a lot of people are using the algorithm, and only you are using your hardware.
That being said, I don't actually know what error you are getting. This is a long and convoluted thread and "as the above" is unfortunately not very descriptive, in part because at least one false observation was given at some point.
@bforsbe My apologies for the lack of clarity. I believe my issue is more similar to #53 and #54 (please let me know how to move my comments if that would help clarity).
It appears in issue #53 that Ozkan Yildiz had the same problem I was having; namely, the 2D classification would stall at the very last part of the expectation step with no error message. 1 GPU would show 100% utilization (although low wattage) while the others have 0% utilization and no memory used. Ozkan seemed to have solved this problem by using a different data set before the discussion diverged into talking about GPU indexing... Although this problem seemed to manifest again for Ozkan (comment at the end of issue #53), and was solved by omitting the flag --dont_combine_weights_via_disc in Relion 1.4. However, Ozkan comments on December 16, 2016 (issue #54) that the problem has resurfaced and that they resorted to using CPU's to avoid this problem all together (comment in issue #54 towards middle of page). Finally, Ozkan mentioned that using --pool 1 alleviated the stalling GPU problem.
In issue #54, @cdk reported that they were having a problem with a job crashing, yet a GPU still running. This would lead users to starting new jobs and getting memory allocation errors. However, @cdk did not follow-up with much information regarding the current state of that problem for them.
At the end of #54, @bforsbe stated that "Using v2.0.4 you should no longer find zombies, as the issue was identified and fixed. It was caused by not-so cautious use of htread-parallelism in a single cuda-context, which in turn caused some dead-locks in internal cuda-threads. The zombie was thus a spin-lock in cuda waiting to return an API-call."
However, I am not sure if there is a difference between a zombie still running on the GPU after a job "crashes" and what I am seeing where the job never technically "crashes," but remains stuck at the last part of an expectation step indefinitely with one GPU at 100% and all others going to 0%.
I have re-started our GPU nodes, and re-run the job starting at the iteration immediately before the crash (iteration 3). The job is now on iteration 6 and seems to be operating successfully. I tracked the buff/cache memory over the course of the run using the command free as I was curious about the RAM. After reboot, buff/cache was empty, and over the several iterations it is now close to ~114-120 GB on all the GPU nodes (we have 128 GB of RAM on each GPU node). Thank you for your tip regarding rebooting the machines.
Is there any other information I can provide to you so that we may nail down the problem? I have run many successful 2D and 3D classifications; however, this problem is the only one I get in Relion that doesn't produce an error message and leaves me sort of clueless as to what is happening. Any help is appreciated. Thanks!
I'm pretty certain your issue is not a lockup as fixed in issue #54, just to have that said.
Relion dispatches mpi-slaves and waits for them to return before continuing. Anytime a slave get stuck doing anything, the symptoms are largely the same, which is why many issues look the same. The interesting observation in your case is the 100% gpu util.
Now, that does not mean that they are working hard, just that the slave got stuck in a state where the gpu was busy. So network dropouts or RAM-swapping is actually quite unlikely. Thinking about it again, the gpu-driver or cuda-runtime are more likely to have dropped out or ended up in a weird state. The first debug for this is, yep, a reboot.
I don't like to say that this is the issue, because it's not something I can fix. But it's looking likely, unless you have further issues. Let me know if that's the case and we'll try to dig deeper.
Thanks for the simply splendid clarification! Your in my good book.
@bforsbe Thanks for the reply! No problem, writing the summary of the other issues actually helped me clarify the issue myself.
Just for completeness, I will post my newest set of experiments that may help confirm the issue. All of these experiments were run across 7 nodes, each with one GPU card (2 1080's and 5 1080 Ti's), using mpi and 8 mpi processes per card (except the first node which has 7 procs per GPU due to mpi master being on this node; I am using SLURM, therefore the first cpu gets assigned to master, and the other seven go to slaves on node 1). The particles.star file contained ~800k particles with a box size of 320 pixels (1.07 A/pix).
To determine if rebooting computers solved the problem, we rebooted all the nodes and ran my 2D classification. The result was that the job ran successfully through 25 iterations.
To determine if having buffer/cache "full" at start of job is causing problems versus some cuda or gpu-driver problem, we re-ran the same 2D classification without doing anything additional to the computers. The buff/cache memory was almost full (~115GB on 128 GB RAM nodes) at the start of the job, and I checked to make sure there were no zombie processes on any node before running the job. The result was that the classification hangs at the end of iteration 4 with all the usual symptoms (2 of the 1080 Ti's stalling at end of expectation with 100% GPU util, 75W/280W, and 10Gb/11Gb GPU memory used. All other GPU's are empty, 0% util, no memory used, etc.).
To further investigate the role of RAM in this problem, we cleared the caches without rebooting the nodes and performed the same classification (again, checking for any sort of zombie processes). The result was that the classification hangs at the end of expectation 2. This time with a different 1080 Ti hanging at 100% util. (Note: I have performed similar experiments, and it is never the same GPU that stalls at the end of the expectation, the problem appears to happen indiscriminately among 1080's and 1080 Ti's among the nodes).
Does this to you vindicate any sort of RAM issue causing the problem and place the blame again on the gpu-driver or cuda-runtime doing something weird at the last part of expectation step? Any additional thoughts? Thanks!
8 MPI-ranks per GPU is more than I have ever run. For 2D-classification, you CAN of course do that because the memory footprint of 2D-images is so much less than that of 3D-models, but I would think that there's not much performance to get from having that many.
I can try to reproduce by simply crowding a GPU with an insane number of MPIs and see if there's some contention or race-condition we missed across MPIs. My advice in the meantime is to scale back to one or two MPIs per GPU and pack on the threads instead.
For 2D, I tend to just
With those settings, most things fly through pretty darn quick, even on one GPU. It avoids MPI-communication overhead and RAM-caches everything. Gets the job done. Some choose to linger and perfect their 2D-classes, but this is time wasted in my mind, especially for initial cleaning of otherwise good data.
@bforsbe Thanks for the suggestions! I talked to a very experienced colleague yesterday, and he says he usually has does 3-4 MPI procs per GPU for 2D with no problems (but he definitely did not like the idea of 8 mpi per gpu!). I will definitely lower the number of mpi procs per GPU in the future, and try your other suggestions.
Hi @bforsbe ,
I am wondering if there is a way to include "-G" (big G) option for nvcc for cuda kernel debugging? I am still debugging this issue. It doesn't seem that DCMAKE_BUILD_TYPE=Debug enables -G for nvcc unless I'm mistaken...
My current workaround is to use --continue when jobs get stuck on an iteration. This usually pushes them through until they finish, or if they get stuck again at a later iteration, then I do --continue again until all iterations are done.
read line 5 of Cmake/BuildTypes.cmake.
Of note, you might also have to set -DBUILD_SHARED_LIBS=OFF
for gdb to read the symbols ok.
Hi guys, I hope you doing well all, I new on using RELION 3.1, I'm facing some issues on using 2D classification on RELION 3.1 I did follow the installation and try to run the GUI it did work but when I tried to run the 2D classification it did not work, I wen back to the comment line approach and download the benchmark dataset with the STAR file and when I tried to run the following comment:
mpirun -n XXX which relion_refine_mpi
--i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d
I got the following error:
please could someone help me how to run the 2D classification in RELION 3.1 really appreciate that! @bforsbe
Hi, you cannot use the XXX when you run from the command line. You need to specify an actual number of mpi tasks. XXX type variables only work when you are using a template script for the Relion GUI to fill in the variables with values from the gui.
On Sat, Sep 26, 2020 at 1:40 PM adil-20 notifications@github.com wrote:
Hi guys, I hope you doing well all, I new on using RELION 3.1, I'm facing some issues on using 2D classification on RELION 3.1 I did follow the installation and try to run the GUI it did work but when I tried to run the 2D classification it did not work, I wen back to the comment line approach and download the benchmark dataset with the STAR file and when I tried to run the following comment:
mpirun -n XXX which relion_refine_mpi --i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d
I got the following error: [image: Screenshot from 2020-09-26 15-38-19] https://user-images.githubusercontent.com/71946383/94349869-69d45f80-000e-11eb-9aca-8587195397a3.png
please could someone help me how to run the 2D classification in RELION 3.1 really appreciate that! @bforsbe https://github.com/bforsbe
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/3dem/relion/issues/166#issuecomment-699545564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDIR24FULQUNE4VOEFCC7DSHZGTPANCNFSM4DCTAVOA .
Originally reported by: Bharat Reddy (Bitbucket: barureddy, GitHub: barureddy)
When I run 2D Classification using my gpus I get an error with relion stalling before it reaches the 25 iterations. The master thread still runs in a zobie like process. My issue looks like #24, but I am not sure. The particle files are quite large, but I can upload them to you if needed.