2D Classification hang-up, no error

bforsbe commented 8 years ago

Originally reported by: Dari Kimanius (Bitbucket: dkimanius, GitHub: dkimanius)

Reported by Özkan Yildiz in Issue #24, moved here due to being a different issue:

We repeated now the same 2D classification with v2.0.b10 after doing it with v2.0.b9 where we observed hang ups of the GPU cards at the end of its 7th iteration. The processes stop exactly at the same iteration 7 and at the same time and one GPU keeps on running on 100 % while the other goes to 0%. We therefore observed the behavior of the two GPU cards during this iteration. It seems that before the event of loosing one GPU, the memory consumption (and 3% does not seem to be much) that was equal on both our 2 GPU cards in previous iterations gets shifted during iteration 7 to only one GPU to the double amount (see below). And after a while the second zero-memory consuming GPU goes down (see below) after idling around some time at 100 %. So it looks like one GPU card took over the whole work of the second card. The iteration with the hang up takes longer than it should in theory, and the output hangs just before the mouse gets to its end. We noticed that after around 30 min (assuming this would be the time needed for the whole iteration) the time for the expectation starts rising to about 50 min and than one of the cards does not consume memory anymore. Maybe there is something wrong with parallelisation? The temperatures for both GPU cards seem to be OK and there is no sign that one GPU card has a temperature problem. We are doing 3D classifications on the same GPU cards and they work perfectly fine. We are also doing the same 2D classification on only CPUs and it is running fine. Now, we are going to try to do the same 2D classification on only one GPU card (with half the number of CPUs in order to see if the same thing happens. Usual output of nvidia-smi dmon during iteration 6

#!batch

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0   149    79    99     4     0     0  3304  1177
    1   166    65    99     3     0     0  3304  1366
    0   151    79    99     3     0     0  3304  1177
    1   162    65    99     4     0     0  3304  1366
    0   148    79    99     4     0     0  3304  1177
    1   167    65    99     4     0     0  3304  1366
    0   154    79    99     3     0     0  3304  1177
    1   163    65    99     4     0     0  3304  1366
    0   145    79    99     4     0     0  3304  1177
    1   164    65    99     4     0     0  3304  1366
    0   152    79    99     4     0     0  3304  1177
    1   158    65    99     4     0     0  3304  1366
    0   159    79    99     4     0     0  3304  1177
    1   165    65    99     4     0     0  3304  1366
    0   150    79    99     3     0     0  3304  1177
    1   160    65    99     4     0     0  3304  1366

output of nvidia-smi dmon when the iteration 7 hangs and one GPU goes down

#!batch

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0    95    68   100     0     0     0  3304  1189
    1   117    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   120    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   123    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   121    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   116    60    99     6     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   145    64    99     4     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   137    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   141    64    99     4     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   140    64    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   136    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   136    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   133    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   139    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   142    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   141    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   130    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   135    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   136    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   135    63    99     5     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1   132    63    68     2     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1    93    62     1     0     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1    95    62     3     0     0     0  3304  1366
    0    95    68   100     0     0     0  3304  1189
    1    80    61     5     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163
    0    95    68   100     0     0     0  3304  1189
    1    76    60     0     0     0     0  3304  1163

Commandline

#!batch

mpirun -n 35 relion_refine_mpi --o Class2D/job074/run --i Extract/job071/particles.star --dont_combine_weights_via_disc --pool 50 --ctf --iter 30 --tau2_fudge 2 --particle_diameter 190 --K 30 --flatten_solvent --zero_mask --strict_highres_exp 15 --oversampling 1 --psi_step 12 --offset_range 3 --offset_step 2 --norm --scale --j 1 --gpu

Output of iteratin 6 and 7

#!batch

Auto-refine: Estimated accuracy angles= 14.1 degrees; offsets= 6.9 pixels
 CurrentResolution= 11.52 Angstroms, which requires orientationSampling of at least 6.92308 degrees for a particle of diameter 190 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 8640
 OrientationalSampling= 11.25 NrOrientations= 32
 TranslationalSampling= 2 NrTranslations= 9
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 276480
 OrientationalSampling= 5.625 NrOrientations= 256
 TranslationalSampling= 1 NrTranslations= 36
=============================
 Estimated memory for expectation  step > 0.316617 Gb.
 Estimated memory for maximization step > 0.000545025 Gb.
 Expectation iteration 6 of 30
36.85/36.85 min ............................................................~~(,_,">
 Maximization ...
   1/   1 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ... 
   1/   1 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 9.1 degrees; offsets= 5.45 pixels
 CurrentResolution= 10.9964 Angstroms, which requires orientationSampling of at least 6.54545 degrees for a particle of diameter 190 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 8640
 OrientationalSampling= 11.25 NrOrientations= 32
 TranslationalSampling= 2 NrTranslations= 9
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 276480
 OrientationalSampling= 5.625 NrOrientations= 256
 TranslationalSampling= 1 NrTranslations= 36
=============================
 Estimated memory for expectation  step > 0.324049 Gb.
 Estimated memory for maximization step > 0.000579759 Gb.
 Expectation iteration 7 of 30
49.70/50.05 min ...........................................................~~(,_,">

stops at this point and no error message is shown.

Bitbucket: https://bitbucket.org/tcblab/relion2-beta/issue/53

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

We also observed the same problem now with hang ups at the end of the expectation iteration using Relion 1.4 (no GPU usage of course) when we added the flag --dont_combine_weights_via_disc, which is the default in Relion 2.0. It did not matter if it was 2D classification or 3D classification or refinement. When the hang ups appeared it was reproducible when repeating the same run. We omitted this flag again and it worked fine afterwards with no hang ups.

Therefore, it is maybe necessary to check if this flag really fits to a certain system or not. We usually never used it with Relion 1.4 and never had problems so far.

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

The next version will implement the following rule: If no GPU id(s) is given, automatic mapping is performed by relion.

If GPU-id(s) are supplied for exactly one rank, this selection is extended to apply to all ranks.

If GPU-id(s) are supplied for more than one rank, then this selection is used for those ranks, and automatic mapping is performed for any additional ranks.

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Sounds good!

bforsbe commented 8 years ago

Original comment by Sjors Scheres (Bitbucket: scheres, GitHub: scheres):

I'm fine with that also. As long as we document it clearly. Shall I give you access to the relion Wiki so we can make a GPU page there?

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

At the moment, no, they don't all go to zero. The default is that if you use ANY syntax, it is taken as an explicit and informed selection; if you don's specify the GPUs for each rank indiviually, that means that it will use all of them. In short; at the moment you need to say "use only device 0" for all ranks, not just once;

on a 2-GPU machine:

mpirun -n 3 --gpu 0 -> uses GPU 0 for first rank, and GPU for 1 for second rank (by logic of "use all if nothing is specified")

mpirun -n 3 --gpu 0:0 -> uses only GPU 0 for both ranks.

bforsbe commented 8 years ago

Original comment by Sjors Scheres (Bitbucket: scheres, GitHub: scheres):

Not sure what that last remark means. I think the selection rule is pretty good. What does --gpu 0 do in case of multiple MPIs: do they all go on zero (which is what I would expect)? If you don't give enough semi-colons, I guess the expected behaviour is to distribute evenly over the ones provided?

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I'll include this in the next update.

The --gpu indexing behaviour you observe is entirely expected. If you provide no indexing. Relion will use everything it can find, i.e. all visible GPUs.

--gpu 0 means use device 0 for the first slave (working mpi rank). It says nothing about the second slave.

--gpu 0,0 means distribute the first slave over devices 0 and 0, but still says nothing about the second slave.

-- gpu 0:0 uses a colon to delimit ranks, so that you then apply the same restriction to the second slave as well as the first.

It might be more intuitive to interpret a selection-string without colons to apply to all slaves. @scheres ?

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

fix did work..

no crash anymore.

but if I use the syntax --gpu 0,0 , then still both GPU are used.. the same with --gpu 0

to get only one GPU to work, one has to use the syntax --gpu 0:0

thanks

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I think I have a fix.

Could you help me test it by replacing line 278 in ml_optimiser_mpi.cpp to

#!c++

if ( (allThreadIDs.size()<rank) || allThreadIDs[0].size()==0 || (!std::isdigit(*gpu_ids.begin())) )

and rebuilding ?

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

I forgot to mention:

after reinstallation and update of openmpi and OFED, I build and installed relion from the scratch.. Ö.

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

Hi Bjoern, I now uninstalled openmpi (1.8.1), which we previously installed manually..

I also updated the Mellanox OFED software. The installation of the OFED software includes also

the installation of openmpi 1.10.3rc4

mpirun -n 2 --gpu --j 2 works

also mpirrun -n 2 --gpu 0,0 --j 2 works (only on one of the GPUs)

the same with --mca orte_base_help_aggregate 0

but for n > 2 AND --gpu 0,0 I get still the following (also with --mca orte_base_help_aggregate 0 ):

Ö.


oeyildiz@x36b  /ptmp/oeyildiz/relion_test $ > mpirun --mca orte_base_help_aggregate 0 -n 3 relion_refine_mpi --o Class2D/job002/run --i Import/job001/particles-8000-1410.star --dont_combine_weights_via_disc --pool 50 --ctf  --iter 25 --tau2_fudge 2 --particle_diameter 180 --K 30 --flatten_solvent  --zero_mask --oversampling 1 --psi_step 10 --offset_range 5 --offset_step 2 --norm --scale  --gpu 0,0  --j 2
[1469107917.819943] [x36b:147940:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 1500.12
[1469107917.829929] [x36b:147942:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 1500.12
[1469107917.830117] [x36b:147941:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 1500.12
 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process  = 2
 + Total number of threads therefore  = 6
 + Master  (0) runs on host            = x36b
 + Slave     1 runs on host            = x36b
 + Slave     2 runs on host            = x36b
 =================
 Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
 Estimating initial noise spectra 
  12/  12 sec ............................................................~~(,_,">
 uniqueHost x36b has 2 ranks.
 Using explicit indexing on slave 0 to assign devices  0 0
 Thread 0 on slave 1 mapped to device 0
 Thread 1 on slave 1 mapped to device 0
 Slave 2 will distribute threads over devices  [x36b:147940:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000006749c mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3092/src/mxm/util/debug/debug.c:641
 3 0x00000000000679ec mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.5.3092/src/mxm/util/debug/debug.c:616
 4 0x0000000000035670 killpg()  ??:0
 5 0x0000000000095580 _ZN9__gnu_cxx18stdio_sync_filebufIwSt11char_traitsIwEE8overflowEj()  ??:0
 6 0x000000000025573c _ZN14MlOptimiserMpi10initialiseEv()  /usr/local/relion2-beta/src/ml_optimiser_mpi.cpp:289
 7 0x00000000004163f9 main()  /usr/local/relion2-beta/src/apps/refine_mpi.cpp:43
 8 0x0000000000021b15 __libc_start_main()  ??:0
 9 0x0000000000416279 _start()  ??:0
===================
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 147940 on node x36b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
oeyildiz@x36b  /ptmp/oeyildiz/relion_test $ >

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

@oeyildiz In addition to the debug-version it would be helpful if you could execute

#!bash

mpirun --mca orte_base_help_aggregate 0 -n 2 relion_refine_mpi --j 2

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

It's fairly strange that you get a segfault during initialization, which hints that it is unable to use the device-selection syntax. Then again, the segfault is reported before device assignment in your most recent output (--gpu 0,0). But given that if goes through this segment of the code fine when you don't specify indices, it must be the device assignment that goes wrong.

You could try to build a debug version and run that, it will be MUCH slower but may give us a bit more information about where it encounters the error;

#!bash

mkdir build_debug
cd build_debug
cmake -DCMAKE_BUILD_TYPE=Debug ..
make -j 8

Then, use

#!bash

mpirun -n 2 --gpu --j 2

to try to catch as much error output as possible without ranks cutting each other off or intermixing output. If you could attach a log of the output you see when/if this fails in the same way then there should be good markers for what your issue is.

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

Yes, it is the latest beta:

root@x36b /usr/local/relion2-beta ## > git pull

Password for 'https://oeyildiz@bitbucket.org':

Already up-to-date.

root@x36b /usr/local/relion2-beta ## >

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I don't believe so, but there is definitely something in your system setup that we did not anticipate. Are you on the latest beta version?

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

Unfortunately, it does not work with --gpu 0,0.

[x36b:132846] *** Process received signal ***
[x36b:132846] Signal: Segmentation fault (11)
[x36b:132846] Signal code: Address not mapped (1)
[x36b:132846] Failing at address: 0x21
 uniqueHost x36b has 16 ranks.
 Slave 1 will distribute threads over devices  0 0
 Thread 0 on slave 1 mapped to device 0
[x36b:132846] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7fed4c93a100]
[x36b:132846] [ 1] /usr/local/relion_2.0beta/lib/librelion_lib.so(_ZN14MlOptimiserMpi10initialiseEv+0x109e)[0x7fed53d7baee]
[x36b:132846] [ 2] relion_refine_mpi(main+0x5f)[0x40a17f]
[x36b:132846] [ 3] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fed4c58ab15]
[x36b:132846] [ 4] relion_refine_mpi[0x40a301]
[x36b:132846] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 132846 on node x36b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Could it be a hyperthreading issue that causes this assignment?

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

As a quick test, try --gpu 0,0 . This will activate a different device-selection syntax, which spreads threads across all listed devices, i.e. you should still only use GPU 0. If this works, then that helps me nail down where the issue is.

Also, by default each --j thread should map to it's own CPU, as long as there are enough of them. What setting(s) is causing them to me assigned to the same CPU?

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

We tried the same 2D classification with another dataset and it works fine. So it is hopefully the scale-correction.

If we only use a single GPU as you suggest with --gpu 0 like

mpirun -n 17 --gpu 0 --j 1

it will result in this error message

--------------------------------------------------------------------------
 uniqueHost x36b has 17 ranks.
 Using explicit indexing on slave 0 to assign devices  0
 Thread 0 on slave 1 mapped to device 0
[x36b:10029] *** Process received signal ***
[x36b:10029] Signal: Segmentation fault (11)
[x36b:10029] Signal code: Address not mapped (1)
[x36b:10029] Failing at address: 0x61
[x36b:10029] [ 0] /usr/lib64/libpthread.so.0(+0xf100)[0x7f2380ce9100]
[x36b:10029] [ 1]
/usr/local/relion_2.0beta/lib/librelion_lib.so(_ZN14MlOptimiserMpi10initialiseEv+0x109e)[0x7f238812aaee]
[x36b:10029] [ 2] relion_refine_mpi(main+0x5f)[0x40a17f]
[x36b:10029] [ 3]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f2380939b15]
[x36b:10029] [ 4] relion_refine_mpi[0x40a301]
[x36b:10029] *** End of error message ***
[x36b:10027] 15 more processes have sent help message help-mpi-runtime.txt
/ mpi_init:warn-fork
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10029 on node x36b exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

With our current settings, if we use mpirun -n 5 and --j 4, the 5 CPUs will share on average 25 % (one fourth) of the performance, which makes it somehow slower again. We therefore went better for 2D classification with more mpirun-slaves and --j 1

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I can't see anything remarkable, but some micrographs do have a wonky scale-correction (see e.g. line 3743 in run_it006_model.star) , which makes the calculations unbalanced and possibly slow and volatile. You could try re-running it without the --scale flag to test this hypothesis. --firstiter_cc might also help, but this is just a guess based on the fact that cc tends to stabilize calculations.

I should remark that we never designed relion to use this many ranks on the same GPU. The fact that this is needed to get the best possible performance is rather an issue which should be fixed. Running that many ranks makes memory considerations more important, and having 16 separate processes share the same piece of limited memory becomes very volatile without very sophisticated ways of managing it. Needless to say, we would rather spend time making sure that you do not need 16 ranks, as opposed to making sure 16 ranks can share a single GPU. How mush faster is

#!bash
mpirun -n 17 --gpu 0 --j 1

compared to e.g.

#!bash
mpirun -n 5 --gpu 0 --j 4

I would expect them to differ, but using 30 classes, they should not be vastly different.

bforsbe commented 8 years ago

Original comment by Özkan Yildiz (Bitbucket: oeyildiz, GitHub: Unknown):

I am working on getting the star files.

We recently also observed the same problem with 3D classification using local searches with a subset of the same dataset we used for the 2D classification above. The particles are not rescaled and the problem appears also using the same particles extracted with different box sizes.

Since we only use 2 GPU, could it be also a general problem, when only a single (remaining) GPU is running? When we for example tried to put 17 mpi slaves on one single GPU (mpirun -n 17 ... --gpu 0), it crashed with segfaults. I think actually that both GPU should terminate at the same time, but it might be the problem that one is going down much earlier than the other. Because it usually happens only at the end of an iteration, or at least when it was expected to end.

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

@oeyildiz Could you post all star-files for it006? We've seen similar things before and in that case it resulted from the scale-correction going wonky for certain images/micrographs, resulting in very similar weights and unexpected amounts of effort and memory needed, which could definitely cause issues when running 17 MPI-slaves per GPU.

Also, the "memory usage" reported by nvidia-smi is different if you use dmon. If a block of memory is allocated but not currently used, dmon will probably not show that as being used. I think what dmon reports is memory transaction intensity or so.

Thanks for reporting, this is valuable feedback!

3dem / relion

2D Classification hang-up, no error #53