boutproject / BOUT-dev

BOUT++: Plasma fluid finite-difference simulation code in curvilinear coordinate systems
http://boutproject.github.io/
GNU Lesser General Public License v3.0
184 stars 95 forks source link

Test-suite extremly slow #762

Closed dschwoerer closed 6 years ago

dschwoerer commented 6 years ago

The test suite can be extremly slow. Example

The reason is that mpich, and also openmpi, perform extremly badly in the case of oversubscription. Should/Could we reduce the number of parallel tests?

On an parallel machine the whole thing takes less then 5 minutes, but on the busy server in the example it took about 3 hours.

PS: Sorry for pressing enter to early

d7919 commented 6 years ago

This can also be true for the tests with openmp as well. I've been experimenting with travis to find why/when the openmp cases take too long etc.

Running the test-delp2 case, which runs on 1, 2 and 4 processors, with and without openmp is perhaps interesting. I modified the output to write the real/user/sys time for each of the six cases run in test-delp2 (the three processor counts for two mesh counts).

The pure mpi case shows a substantial jump in the real time taken but no real change in the user time. The hybrid case with two threads shows a very large jump in both the real and user time when using more than one mpi process. The test-delp2 case took 5s without openmp and 5.5 minutes with openmp.

d7919 commented 6 years ago

It seems that yielding when idle can help a bit, but is still not good enough really.

dschwoerer commented 6 years ago

I run the blob2d.py example, with nout=1 on a local virtual machine.

1 MPI Thread, 2 cores
real    0m1.621s
user    0m1.515s
sys     0m0.163s

2 MPI Thread, 2 cores
real    0m1.092s
user    0m1.922s
sys     0m0.175s

2 MPI Thread, 2 cores
real    0m9.845s
user    0m17.644s
sys     0m1.871s

1 MPI Thread, 1 cores
real    0m1.627s
user    0m1.486s
sys     0m0.083s

2 MPI Thread, 1 cores
real    0m9.097s
user    0m8.063s
sys     0m0.971s

4 MPI Thread, 1 cores
real    0m33.167s
user    0m30.068s
sys     0m2.985s

I didn't recompile with openmp enabled, but nevertheless I never had the case that real was much larger than user ...

I thought travis gives 2 cores to each build - but they might do some smarter algorithm for load balancing?

Is the --mca an OpenMPI only thing? MPICH complains about unrecognised argument mca.

d7919 commented 6 years ago

Yes the --mca is openmpi only unfortunately. From what I've read the busy polling is baked into the default "nemesis" communication channel used by MPICH and you have to rebuild/configure MPICH to enable different ones that don't force this.

Travis has two different types of build system container based and vm based -- the container based version gives you exactly two cores and boots faster but unfortunately seems to end up being slower so we currently use the vm based approach which gives "~2" cores in bursted mode (i.e. not allocated exclusive usage of two cores I believe).

This branch uses the container based approach (it just requires removing one line from the yml file to do this) but is quite a bit slower than the vm approach.

d7919 commented 6 years ago

(If I recall correctly when I looked at this previously the slow down in the container was related to the use of the AUFS file system which apparently has some issues.)

ZedThree commented 6 years ago

It might help moving what we can into unit tests, then we at least don't have to pay the cost of setting up/tearing down full physics models each time. I have a basic example of mpi unit tests in mpi_tests branch

dschwoerer commented 6 years ago

Or rewrite test using the python interface: Old test with C++ and starting/stopping BOUT++

. 6062.005 s - interpolation

New test with the python interface to around 240 ms ... To be fair - the new test is without MPI - The GlobalField could be useful for this - but I haven't done this yet ...

The above timings are from mock - without the time goes only from 7.6 s to 230 ms ...

Overview where we can improve:

======= Making integrated tests ========
Making 21 tests
.....................

======= All make tests passed in 33.73 seconds =======
+ ./test_suite
======= Starting integrated tests ========
Running 24 tests
. 121.223 s - test-io
.   0.270 s - test-command-args
. 121.035 s - test-fieldfactory
. 201.116 s - test-restarting
. 123.258 s - test-laplace
. 482.710 s - test-cyclic
. 362.147 s - test-invpar
. 161.132 s - test-smooth
.  80.504 s - test-region-iterator
. 121.062 s - test-gyro
. 274.479 s - test-delp2
.  40.682 s - test-vec
.  40.401 s - test-griddata
. 120.601 s - test-dataiterator
. 120.678 s - test-dataiterator2
.   0.596 s - test-fieldgroup
. 166.693 s - test-initial
. 160.936 s - test-stopCheck
.  40.324 s - test-subdir
.   0.319 s - test-aiolos
.  93.473 s - test-fci-slab
. 124.938 s - test-drift-instability
.  91.766 s - test-interchange-instability
.   0.026 s - test-code-style

======= All tests passed in 3050.40 seconds =======
+ popd
~/build/BUILD/BOUT-dev-95745b97af4687d51de9400ace7537fc71b23608
+ pushd build_mpich/tests/MMS
~/build/BUILD/BOUT-dev-95745b97af4687d51de9400ace7537fc71b23608/build_mpich/tests/MMS ~/build/BUILD/BOUT-dev-95745b97af4687d51de9400ace7537fc71b23608
+ ./test_suite
======= Starting MMS tests ========
Running 7 tests
. 245.797 s - diffusion
. 242.839 s - wave-1d
. 242.686 s - wave-1d-y
.   0.186 s - interpolation
.   5.716 s - derivatives
.   5.818 s - derivatives2
.   0.294 s - derivatives_flux

======= All tests passed in 743.34 seconds =======
dschwoerer commented 6 years ago

That is probably related to no network access, which makes calling gethostbyname slow. Replaced by https://github.com/rpm-software-management/mock/issues/164

PS: We can still try to make it faster :+1: