Closed bforsbe closed 7 years ago
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
We also observed the same problem now with hang ups at the end of the expectation iteration using Relion 1.4 (no GPU usage of course) when we added the flag --dont_combine_weights_via_disc, which is the default in Relion 2.0. It did not matter if it was 2D classification or 3D classification or refinement. When the hang ups appeared it was reproducible when repeating the same run. We omitted this flag again and it worked fine afterwards with no hang ups.
Therefore, it is maybe necessary to check if this flag really fits to a certain system or not. We usually never used it with Relion 1.4 and never had problems so far.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
The next version will implement the following rule: If no GPU id(s) is given, automatic mapping is performed by relion.
If GPU-id(s) are supplied for exactly one rank, this selection is extended to apply to all ranks.
If GPU-id(s) are supplied for more than one rank, then this selection is used for those ranks, and automatic mapping is performed for any additional ranks.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
At the moment, no, they don't all go to zero. The default is that if you use ANY syntax, it is taken as an explicit and informed selection; if you don's specify the GPUs for each rank indiviually, that means that it will use all of them. In short; at the moment you need to say "use only device 0" for all ranks, not just once;
on a 2-GPU machine:
mpirun -n 3 --gpu 0 -> uses GPU 0 for first rank, and GPU for 1 for second rank (by logic of "use all if nothing is specified")
mpirun -n 3 --gpu 0:0 -> uses only GPU 0 for both ranks.
Original comment by Sjors Scheres (Bitbucket: scheres, GitHub: scheres):
Not sure what that last remark means. I think the selection rule is pretty good. What does --gpu 0 do in case of multiple MPIs: do they all go on zero (which is what I would expect)? If you don't give enough semi-colons, I guess the expected behaviour is to distribute evenly over the ones provided?
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
I'll include this in the next update.
The --gpu indexing behaviour you observe is entirely expected. If you provide no indexing. Relion will use everything it can find, i.e. all visible GPUs.
--gpu 0 means use device 0 for the first slave (working mpi rank). It says nothing about the second slave.
--gpu 0,0 means distribute the first slave over devices 0 and 0, but still says nothing about the second slave.
-- gpu 0:0 uses a colon to delimit ranks, so that you then apply the same restriction to the second slave as well as the first.
It might be more intuitive to interpret a selection-string without colons to apply to all slaves. @scheres ?
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
fix did work..
no crash anymore.
but if I use the syntax --gpu 0,0 , then still both GPU are used.. the same with --gpu 0
to get only one GPU to work, one has to use the syntax --gpu 0:0
thanks
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
I forgot to mention:
after reinstallation and update of openmpi and OFED, I build and installed relion from the scratch.. Ö.
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
Hi Bjoern, I now uninstalled openmpi (1.8.1), which we previously installed manually..
I also updated the Mellanox OFED software. The installation of the OFED software includes also
the installation of openmpi 1.10.3rc4
mpirun -n 2 --gpu --j 2 works
also mpirrun -n 2 --gpu 0,0 --j 2 works (only on one of the GPUs)
the same with --mca orte_base_help_aggregate 0
but for n > 2 AND --gpu 0,0 I get still the following (also with --mca orte_base_help_aggregate 0 ):
Ö.
oeyildiz@x36b /ptmp/oeyildiz/relion_test $ > mpirun --mca orte_base_help_aggregate 0 -n 3 relion_refine_mpi --o Class2D/job002/run --i Import/job001/particles-8000-1410.star --dont_combine_weights_via_disc --pool 50 --ctf --iter 25 --tau2_fudge 2 --particle_diameter 180 --K 30 --flatten_solvent --zero_mask --oversampling 1 --psi_step 10 --offset_range 5 --offset_step 2 --norm --scale --gpu 0,0 --j 2
[1469107917.819943] [x36b:147940:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 1500.12
[1469107917.829929] [x36b:147942:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 1500.12
[1469107917.830117] [x36b:147941:0] sys.c:744 MXM WARN Conflicting CPU frequencies detected, using: 1500.12
=== RELION MPI setup ===
+ Number of MPI processes = 3
+ Number of threads per MPI process = 2
+ Total number of threads therefore = 6
+ Master (0) runs on host = x36b
+ Slave 1 runs on host = x36b
+ Slave 2 runs on host = x36b
=================
Running CPU instructions in double precision.
+ WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
Estimating initial noise spectra
12/ 12 sec ............................................................~~(,_,">
uniqueHost x36b has 2 ranks.
Using explicit indexing on slave 0 to assign devices 0 0
Thread 0 on slave 1 mapped to device 0
Thread 1 on slave 1 mapped to device 0
Slave 2 will distribute threads over devices [x36b:147940:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x000000000006749c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.5.3092/src/mxm/util/debug/debug.c:641
3 0x00000000000679ec mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.5.3092/src/mxm/util/debug/debug.c:616
4 0x0000000000035670 killpg() ??:0
5 0x0000000000095580 _ZN9__gnu_cxx18stdio_sync_filebufIwSt11char_traitsIwEE8overflowEj() ??:0
6 0x000000000025573c _ZN14MlOptimiserMpi10initialiseEv() /usr/local/relion2-beta/src/ml_optimiser_mpi.cpp:289
7 0x00000000004163f9 main() /usr/local/relion2-beta/src/apps/refine_mpi.cpp:43
8 0x0000000000021b15 __libc_start_main() ??:0
9 0x0000000000416279 _start() ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 147940 on node x36b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
oeyildiz@x36b /ptmp/oeyildiz/relion_test $ >
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
It's fairly strange that you get a segfault during initialization, which hints that it is unable to use the device-selection syntax. Then again, the segfault is reported before device assignment in your most recent output (--gpu 0,0). But given that if goes through this segment of the code fine when you don't specify indices, it must be the device assignment that goes wrong.
You could try to build a debug version and run that, it will be MUCH slower but may give us a bit more information about where it encounters the error;
#!bash
mkdir build_debug
cd build_debug
cmake -DCMAKE_BUILD_TYPE=Debug ..
make -j 8
Then, use
#!bash
mpirun -n 2 --gpu --j 2
to try to catch as much error output as possible without ranks cutting each other off or intermixing output. If you could attach a log of the output you see when/if this fails in the same way then there should be good markers for what your issue is.
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
Yes, it is the latest beta:
root@x36b /usr/local/relion2-beta ## > git pull
Password for 'https://oeyildiz@bitbucket.org':
Already up-to-date.
root@x36b /usr/local/relion2-beta ## >
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
Unfortunately, it does not work with --gpu 0,0.
[x36b:132846] *** Process received signal ***
[x36b:132846] Signal: Segmentation fault (11)
[x36b:132846] Signal code: Address not mapped (1)
[x36b:132846] Failing at address: 0x21
uniqueHost x36b has 16 ranks.
Slave 1 will distribute threads over devices 0 0
Thread 0 on slave 1 mapped to device 0
[x36b:132846] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7fed4c93a100]
[x36b:132846] [ 1] /usr/local/relion_2.0beta/lib/librelion_lib.so(_ZN14MlOptimiserMpi10initialiseEv+0x109e)[0x7fed53d7baee]
[x36b:132846] [ 2] relion_refine_mpi(main+0x5f)[0x40a17f]
[x36b:132846] [ 3] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fed4c58ab15]
[x36b:132846] [ 4] relion_refine_mpi[0x40a301]
[x36b:132846] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 132846 on node x36b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Could it be a hyperthreading issue that causes this assignment?
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
As a quick test, try --gpu 0,0 . This will activate a different device-selection syntax, which spreads threads across all listed devices, i.e. you should still only use GPU 0. If this works, then that helps me nail down where the issue is.
Also, by default each --j thread should map to it's own CPU, as long as there are enough of them. What setting(s) is causing them to me assigned to the same CPU?
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
We tried the same 2D classification with another dataset and it works fine. So it is hopefully the scale-correction.
If we only use a single GPU as you suggest with --gpu 0 like
mpirun -n 17 --gpu 0 --j 1
it will result in this error message
--------------------------------------------------------------------------
uniqueHost x36b has 17 ranks.
Using explicit indexing on slave 0 to assign devices 0
Thread 0 on slave 1 mapped to device 0
[x36b:10029] *** Process received signal ***
[x36b:10029] Signal: Segmentation fault (11)
[x36b:10029] Signal code: Address not mapped (1)
[x36b:10029] Failing at address: 0x61
[x36b:10029] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7f2380ce9100]
[x36b:10029] [ 1]
/usr/local/relion_2.0beta/lib/librelion_lib.so(_ZN14MlOptimiserMpi10initialiseEv+0x109e)[0x7f238812aaee]
[x36b:10029] [ 2] relion_refine_mpi(main+0x5f)[0x40a17f]
[x36b:10029] [ 3]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f2380939b15]
[x36b:10029] [ 4] relion_refine_mpi[0x40a301]
[x36b:10029] *** End of error message ***
[x36b:10027] 15 more processes have sent help message help-mpi-runtime.txt
/ mpi_init:warn-fork
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10029 on node x36b exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
With our current settings, if we use mpirun -n 5 and --j 4, the 5 CPUs will share on average 25 % (one fourth) of the performance, which makes it somehow slower again. We therefore went better for 2D classification with more mpirun-slaves and --j 1
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
I can't see anything remarkable, but some micrographs do have a wonky scale-correction (see e.g. line 3743 in run_it006_model.star) , which makes the calculations unbalanced and possibly slow and volatile. You could try re-running it without the --scale flag to test this hypothesis. --firstiter_cc might also help, but this is just a guess based on the fact that cc tends to stabilize calculations.
I should remark that we never designed relion to use this many ranks on the same GPU. The fact that this is needed to get the best possible performance is rather an issue which should be fixed. Running that many ranks makes memory considerations more important, and having 16 separate processes share the same piece of limited memory becomes very volatile without very sophisticated ways of managing it. Needless to say, we would rather spend time making sure that you do not need 16 ranks, as opposed to making sure 16 ranks can share a single GPU. How mush faster is
#!bash
mpirun -n 17 --gpu 0 --j 1
compared to e.g.
#!bash
mpirun -n 5 --gpu 0 --j 4
I would expect them to differ, but using 30 classes, they should not be vastly different.
Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):
I am working on getting the star files.
We recently also observed the same problem with 3D classification using local searches with a subset of the same dataset we used for the 2D classification above. The particles are not rescaled and the problem appears also using the same particles extracted with different box sizes.
Since we only use 2 GPU, could it be also a general problem, when only a single (remaining) GPU is running? When we for example tried to put 17 mpi slaves on one single GPU (mpirun -n 17 ... --gpu 0), it crashed with segfaults. I think actually that both GPU should terminate at the same time, but it might be the problem that one is going down much earlier than the other. Because it usually happens only at the end of an iteration, or at least when it was expected to end.
Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):
@oeyildiz Could you post all star-files for it006? We've seen similar things before and in that case it resulted from the scale-correction going wonky for certain images/micrographs, resulting in very similar weights and unexpected amounts of effort and memory needed, which could definitely cause issues when running 17 MPI-slaves per GPU.
Also, the "memory usage" reported by nvidia-smi is different if you use dmon. If a block of memory is allocated but not currently used, dmon will probably not show that as being used. I think what dmon reports is memory transaction intensity or so.
Thanks for reporting, this is valuable feedback!
Originally reported by: Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius)
Reported by Özkan Yildiz in Issue #24, moved here due to being a different issue:
We repeated now the same 2D classification with v2.0.b10 after doing it with v2.0.b9 where we observed hang ups of the GPU cards at the end of its 7th iteration. The processes stop exactly at the same iteration 7 and at the same time and one GPU keeps on running on 100 % while the other goes to 0%. We therefore observed the behavior of the two GPU cards during this iteration. It seems that before the event of loosing one GPU, the memory consumption (and 3% does not seem to be much) that was equal on both our 2 GPU cards in previous iterations gets shifted during iteration 7 to only one GPU to the double amount (see below). And after a while the second zero-memory consuming GPU goes down (see below) after idling around some time at 100 %. So it looks like one GPU card took over the whole work of the second card. The iteration with the hang up takes longer than it should in theory, and the output hangs just before the mouse gets to its end. We noticed that after around 30 min (assuming this would be the time needed for the whole iteration) the time for the expectation starts rising to about 50 min and than one of the cards does not consume memory anymore. Maybe there is something wrong with parallelisation? The temperatures for both GPU cards seem to be OK and there is no sign that one GPU card has a temperature problem. We are doing 3D classifications on the same GPU cards and they work perfectly fine. We are also doing the same 2D classification on only CPUs and it is running fine. Now, we are going to try to do the same 2D classification on only one GPU card (with half the number of CPUs in order to see if the same thing happens. Usual output of nvidia-smi dmon during iteration 6
output of nvidia-smi dmon when the iteration 7 hangs and one GPU goes down
Commandline
Output of iteratin 6 and 7
stops at this point and no error message is shown.