Closed dschwoerer closed 6 years ago
This can also be true for the tests with openmp as well. I've been experimenting with travis to find why/when the openmp cases take too long etc.
Running the test-delp2 case, which runs on 1, 2 and 4 processors, with and without openmp is perhaps interesting. I modified the output to write the real/user/sys time for each of the six cases run in test-delp2 (the three processor counts for two mesh counts).
The pure mpi case shows a substantial jump in the real time taken but no real change in the user time. The hybrid case with two threads shows a very large jump in both the real and user time when using more than one mpi process. The test-delp2 case took 5s without openmp and 5.5 minutes with openmp.
It seems that yielding when idle can help a bit, but is still not good enough really.
I run the blob2d.py example, with nout=1
on a local virtual machine.
1 MPI Thread, 2 cores
real 0m1.621s
user 0m1.515s
sys 0m0.163s
2 MPI Thread, 2 cores
real 0m1.092s
user 0m1.922s
sys 0m0.175s
2 MPI Thread, 2 cores
real 0m9.845s
user 0m17.644s
sys 0m1.871s
1 MPI Thread, 1 cores
real 0m1.627s
user 0m1.486s
sys 0m0.083s
2 MPI Thread, 1 cores
real 0m9.097s
user 0m8.063s
sys 0m0.971s
4 MPI Thread, 1 cores
real 0m33.167s
user 0m30.068s
sys 0m2.985s
I didn't recompile with openmp enabled, but nevertheless I never had the case that real was much larger than user ...
I thought travis gives 2 cores to each build - but they might do some smarter algorithm for load balancing?
Is the --mca
an OpenMPI only thing? MPICH complains about unrecognised argument mca.
Yes the --mca
is openmpi only unfortunately. From what I've read the busy polling is baked into the default "nemesis" communication channel used by MPICH and you have to rebuild/configure MPICH to enable different ones that don't force this.
Travis has two different types of build system container based and vm based -- the container based version gives you exactly two cores and boots faster but unfortunately seems to end up being slower so we currently use the vm based approach which gives "~2" cores in bursted mode (i.e. not allocated exclusive usage of two cores I believe).
This branch uses the container based approach (it just requires removing one line from the yml file to do this) but is quite a bit slower than the vm approach.
(If I recall correctly when I looked at this previously the slow down in the container was related to the use of the AUFS file system which apparently has some issues.)
It might help moving what we can into unit tests, then we at least don't have to pay the cost of setting up/tearing down full physics models each time. I have a basic example of mpi unit tests in mpi_tests branch
Or rewrite test using the python interface: Old test with C++ and starting/stopping BOUT++
. 6062.005 s - interpolation
New test with the python interface to around 240 ms ... To be fair - the new test is without MPI - The GlobalField could be useful for this - but I haven't done this yet ...
The above timings are from mock - without the time goes only from 7.6 s to 230 ms ...
Overview where we can improve:
======= Making integrated tests ========
Making 21 tests
.....................
======= All make tests passed in 33.73 seconds =======
+ ./test_suite
======= Starting integrated tests ========
Running 24 tests
. 121.223 s - test-io
. 0.270 s - test-command-args
. 121.035 s - test-fieldfactory
. 201.116 s - test-restarting
. 123.258 s - test-laplace
. 482.710 s - test-cyclic
. 362.147 s - test-invpar
. 161.132 s - test-smooth
. 80.504 s - test-region-iterator
. 121.062 s - test-gyro
. 274.479 s - test-delp2
. 40.682 s - test-vec
. 40.401 s - test-griddata
. 120.601 s - test-dataiterator
. 120.678 s - test-dataiterator2
. 0.596 s - test-fieldgroup
. 166.693 s - test-initial
. 160.936 s - test-stopCheck
. 40.324 s - test-subdir
. 0.319 s - test-aiolos
. 93.473 s - test-fci-slab
. 124.938 s - test-drift-instability
. 91.766 s - test-interchange-instability
. 0.026 s - test-code-style
======= All tests passed in 3050.40 seconds =======
+ popd
~/build/BUILD/BOUT-dev-95745b97af4687d51de9400ace7537fc71b23608
+ pushd build_mpich/tests/MMS
~/build/BUILD/BOUT-dev-95745b97af4687d51de9400ace7537fc71b23608/build_mpich/tests/MMS ~/build/BUILD/BOUT-dev-95745b97af4687d51de9400ace7537fc71b23608
+ ./test_suite
======= Starting MMS tests ========
Running 7 tests
. 245.797 s - diffusion
. 242.839 s - wave-1d
. 242.686 s - wave-1d-y
. 0.186 s - interpolation
. 5.716 s - derivatives
. 5.818 s - derivatives2
. 0.294 s - derivatives_flux
======= All tests passed in 743.34 seconds =======
That is probably related to no network access, which makes calling gethostbyname
slow.
Replaced by https://github.com/rpm-software-management/mock/issues/164
PS: We can still try to make it faster :+1:
The test suite can be extremly slow. Example
The reason is that mpich, and also openmpi, perform extremly badly in the case of oversubscription. Should/Could we reduce the number of parallel tests?
On an parallel machine the whole thing takes less then 5 minutes, but on the busy server in the example it took about 3 hours.
PS: Sorry for pressing enter to early