Closed kostrzewa closed 5 years ago
Shall I start to download everything which is needed for the cA2.09.48 ensemble in Jülich and start testing the new code on juwels?
I believe @pittlerf has started this to some extent, but certainly doing some initial comparisons of the VdaggerV part is not a bad idea. Don't forget that there are some coding milestones to attain too before this can be considered for merging:
[ ] the case nb_vdaggerv_eigen_threads==1
should be treated in a special way and correspond to the old way of doing build_vdaggerv
-> extract the relevant Eigen operations (including the displacements) into a separate (perhaps inline) function and add some logic to call it inside the t
loop when nb_vdaggerv_eigen_threads==1
and outside of it otherwise
[ ] pure VdaggerV functionality should either be in a separate executable (probably preferable to avoid hackish input files) or triggerable with clear input flags for contract
[ ] it should be ascertained that the contractions do not slow down (much) due to EIGEN_DONT_PARALLELIZE
not being set anymore
- into
Using the old code does corresponds to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?
Using the old code does corresponds to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?
I don't quite understand what you mean, but if I guess what you're asking: currently it clearly doesn't. The old code would imply
nb_evec_read_threads=${OMP_NUM_THREADS}
and, via EIGEN_DONT_PARALLELIZE
, nb_vdaggerv_eigen_threads=1
by definition, but of course ${OMP_NUM_THREADS}
VdaggerV calls run in parallel.
We also need to talk about how you guys compile slaph-contractions on Juwels and the sbatch settings that you use...
We also need to talk about how you guys compile slaph-contractions on Juwels and the sbatch settings that you use...
cmake \ -DCMAKE_C_COMPILER=icc \ -DCMAKE_CXX_COMPILER=icpc \ -DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=haswell -march=haswell -g' \ -DLIME_INCLUDE_DIRS=/p/home/jusers/pittler1/juwels/build/lime/include \ -DLIME_LIBRARIES='-L/p/home/jusers/pittler1/juwels/build/lime/lib -llime' \ /p/project/chbn28/hbn28d/code/sLaph_contractions_bartek/
We also need to talk about how you guys compile slaph-contractions on Juwels and the sbatch settings that you use...
cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=haswell -march=haswell -g' -DLIME_INCLUDE_DIRS=/p/home/jusers/pittler1/juwels/build/lime/include -DLIME_LIBRARIES='-L/p/home/jusers/pittler1/juwels/build/lime/lib -llime' /p/project/chbn28/hbn28d/code/sLaph_contractions_bartek/
1) GCCcore/.8.3.0 (H) 19) Tcl/8.6.9 37) protobuf/.3.7.1 (H) 2) binutils/.2.32 (H) 20) SQLite/.3.27.2 (H) 38) gflags/.2.2.2 (H) 3) StdEnv (H) 21) expat/.2.2.6 (H) 39) libspatialindex/.1.9.0 (H) 4) icc/.2019.3.199-GCC-8.3.0 (H) 22) libpng/.1.6.36 (H) 40) NASM/.2.14.02 (H) 5) ifort/.2019.3.199-GCC-8.3.0 (H) 23) freetype/.2.10.0 (H) 41) libjpeg-turbo/.2.0.2 (H) 6) Intel/2019.3.199-GCC-8.3.0 24) gperf/.3.1 (H) 42) Python/3.6.8 7) pscom/.Default (H) 25) util-linux/.2.33.1 (H) 43) ICU/.64.1 (H) 8) numactl/2.0.12 26) fontconfig/.2.13.1 (H) 44) Boost/1.69.0-Python-3.6.8 9) nvidia/.418.40.04 (H,g) 27) X11/20190311 45) CMake/3.14.0 10) CUDA/10.1.105 (g) 28) Tk/.8.6.9 (H) 46) Bison/.3.3.2 (H) 11) UCX/1.5.1 29) GMP/6.1.2 47) flex/2.6.4 12) ParaStationMPI/5.2.2-1 30) XZ/.5.2.4 (H) 48) imkl/2019.3.199 13) zlib/.1.2.11 (H) 31) libxml2/.2.9.9 (H) 49) Eigen/3.3.7 14) Szip/.2.1.1 (H) 32) libxslt/.1.1.33 (H) 50) jscslurm/.17.11.12 (H,S) 15) HDF5/1.10.5 33) libffi/.3.2.1 (H) 51) jsctools/.0.1 (H,S) 16) bzip2/.1.0.6 (H) 34) libyaml/.0.2.2 (H) 52) .juwels-env (H) 17) ncurses/.6.1 (H) 35) Java/1.8 18) libreadline/.8.0 (H) 36) PostgreSQL/11.2
Yes, and why compile with haswell
on Juwels? (disregarding the fact that this way of specifying things for ICC is only a GCC-compatibility matter)
Yes, and why compile with
haswell
on Juwels? (disregarding the fact that this way of specifying things for ICC is only a GCC-compatibility matter)
Should we compile for skylake?
Should we compile for skylake?
I guess so, either -xSKYLAKE-AVX512
or -xCORE-AVX2
, whatever is faster in the end.
However, I have some bad news. It appears that on Juwels, scaling with the number of VdaggerV threads basically breaks down at 8 threads. The only other thing I can imagine trying is to launch two executables (for two separate configs), one bound to each socket ("multiple program multiple data") and then see if the scaling is better. Cross-socket communication is of course a major concern on Skylake and it might be that this is what we're seing. Another avenue to explore is to use MKL (https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html) via Eigen
Should we compile for skylake?
I guess so, either
-xSKYLAKE-AVX512
or-xCORE-AVX2
, whatever is faster in the end.However, I have some bad news. It appears that on Juwels, scaling with the number of VdaggerV threads basically breaks down at 8 threads. The only other thing I can imagine trying is to launch two executables (for two separate configs), one bound to each socket ("multiple program multiple data") and then see if the scaling is better. Cross-socket communication is of course a major concern on Skylake and it might be that this is what we're seing. Another avenue to explore is to use MKL (https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html) via Eigen
yep, Matthias started a job with 16 reading threads and it seemed to be really slow, I had a job with 2 read and 32 eigen and it took around 3 hour to finish
yep, Matthias started a job with 16 reading threads and it seemed to be really slow, I had a job with 2 read and 32 eigen and it took around 3 hour to finish
Well, it's about a factor of two faster than on lnode07, but of course one would like to improve upon this...
yep, Matthias started a job with 16 reading threads and it seemed to be really slow, I had a job with 2 read and 32 eigen and it took around 3 hour to finish
Well, it's about a factor of two faster than on lnode07, but of course one would like to improve upon this...
My strategie would be to find the read thread which is the fastest when eigenthread is 48 . Do you agree on this?
My strategie would be to find the read thread which is the fastest when eigenthread is 48 . Do you agree on this?
No, my strategy would be:
1) test the "old" way of doing things and see how much faster it is than this (it probably will be since we are able to run with 16 threads apparently) 2) if the difference is small, then the "new" method should be explored more thoroughly: socket pinning to avoid cross-socket comms, MPMD and usage of MKL
Finally, one should note that if the cost for VdaggerV is not so large in total, then we should just go ahead with the calculation in the "old" way, archiving the VdaggerV for the future.
My strategie would be to find the read thread which is the fastest when eigenthread is 48 . Do you agree on this?
No, my strategy would be:
- test the "old" way of doing things and see how much faster it is than this (it probably will be since we are able to run with 16 threads apparently)
- if the difference is small, then the "new" method should be explored more thoroughly: socket pinning to avoid cross-socket comms, MPMD and usage of MKL
But then we compare to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?
But then we compare to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?
What?! The cost is totally dominated by the dense linear algebra (I/O takes only a few seconds per bunch of time slices). As a result, it's basically irrelevant what you set for nb_evec_read_threads
, although 8 or 16 are good choices.
Finally, one should note that if the cost for VdaggerV is not so large in total, then we should just go ahead with the calculation in the "old" way, archiving the VdaggerV for the future.
I think we gain a lot, when computing VdaggerV for all Total momentum at the same time.
I think we gain a lot, when computing VdaggerV for all Total momentum at the same time.
What?!
When I say the "old" way, I mean of course pre-computing VdaggerV and storing it.
I think we gain a lot, when computing VdaggerV for all Total momentum at the same time.
What?!
Yes, that is what is meant, doing VdaggerV for all the total momenta and save it, then read it.
It would still be great to have this "new" method formalised in the code as an option, as long as it doesn't affect overall performance.
Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.
Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.
yes, that Marcus suggested as well
Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.
yes, that Marcus suggested as well
So, now I make a scan with the old code with threads 2,4,8,16,24,32 and look for the maximum
Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.
yes, that Marcus suggested as well
So, now I make a scan with the old code with threads 2,4,8,16,24,32 and look for the maximum
I have made some test and concluded that with 32 eigenthread we have the optimal settings. I will try running to at the same time for read_threads=1
The checks are not passing now. You pass vdaggerv by value, that cannot work.
I'm cleaning up some things so I'll invalidate your push in a second
Okay, I think I've fixed the behaviour with N reading threads and a single eigen thread to be correct (this was not covered by the tests, so it did not show up). @matfischer can you please confirm that the "old" way of running things now works again (nb_vdaggerv_eigen_threads=1
, nb_evec_read_threads=8
) for the 32c64 lattice on qbig? Note that it will be a little slower than it was before because the VdaggerV check is included by default now, but this is a good thing.
To me it seems that it works fine now, but I don't have numbers at hand to compare.
I have made some test and concluded that with 32 eigenthread we have the optimal settings. I will try running to at the same time for read_threads=1
Yeah, on Juwels for the 48c96 there is a "performance resonance" at nb_vdaggerv_eigen_threads=16
and nb_vdaggerv_eigen_threads=32
(with 32 faster than 16) where it seems that things fit well for Eigen to parallelize VdaggerV well. I estimate a VdaggerV time of around 3 hours per config for this.
@pittlerf @matfischer Do you also have numbers already for the 48c96 on Juwels and how the "old" trivially parallel way (with as many threads as fit into memory) and the "new" implementation compare?
I'll decouple now. @martin-ueding if you find some time when you're back, a scan of the changes would be much appreciated and a 'yay' or 'nay' on the "only_vdaggerv_compute_save" mode for handling_vdaggerv
, abusing contract
as a VdaggerV generator. The thing that I'm not happy about here is that the input file for this is kind of hackish, having to specify an operator and a correlator instead of simply passing some max momentum shell number.
@pittlerf @matfischer Do you also have numbers already for the 48c96 on Juwels and how the "old" trivially parallel way (with as many threads as fit into memory) and the "new" implementation compare?
No, so far I have no numbers for the "old" way on juwels. For the "new" way I wasn't even able to receive any kind of output. I used 16 reading threads and 32 eigen threads. So the program aborted after 3 h.
Okay, I think I've fixed the behaviour with N reading threads and a single eigen thread to be correct (this was not covered by the tests, so it did not show up). @matfischer can you please confirm that the "old" way of running things now works again (
nb_vdaggerv_eigen_threads=1
,nb_evec_read_threads=8
) for the 32c64 lattice on qbig? Note that it will be a little slower than it was before because the VdaggerV check is included by default now, but this is a good thing.
So, you mean testing the new contraction executable just for vdaggerv creation where I use not all operators?
No, so far I have no numbers for the "old" way on juwels. For the "new" way I wasn't even able to receive any kind of output. I used 16 reading threads and 32 eigen threads. So the program aborted after 3 h.
There is a time measurement for each phase (bunch of time slices) which allows you to easily compute how long the entire computation will take. What do you mean "any kind of output"?
So, you mean testing the new contraction executable just for vdaggerv creation where I use not all operators?
Yes, setting nb_vdaggerv_eigen_threads=1
and nb_evec_read_threads=8
. The time for VdaggerV creation should be compared to the time that the old code took. Note that the VdaggerV check should be disabled here for fairness because the old code did not perform it (just comment it out of the kernel function for this test)
Finally, it is imperative that also the contraction time itself is checked for any performance regressions as a result of removing EIGEN_DONT_PARALLELIZE
. If we see a slow-down, this will mean that one will need to compile two versions of contract
for production purposes: one for VdaggerV and one for running the actual contractions. As a result, whether or not ``EIGEN_DONT_PARALLELIZE``` is defined should be elevated to a CMake parameter.
No, so far I have no numbers for the "old" way on juwels.
Okay, then you should produce some.
There is a time measurement for each phase (bunch of time slices) which allows you to easily compute how long the entire computation will take. What do you mean "any kind of output"?
No vdaggerv object was written in the output-folder and the outputfile didn't mentioned such a process as well. Only reading threads have been written.The outer loop took 2300 sec and the outputfile does not specify what was performed within that outer loop. The vdaggerv check took 85sec.
No vdaggerv object was written in the output-folder and the outputfile didn't mentioned such a process as well.
Did you read the source code? The "operator" files are written at the very end.
Only reading threads have been written.
I don't understand what this means.
The outer loop took 2300 sec and
How many of the phases ran in those 2300 seconds then?
the outputfile does not specify what was performed within that outer loop.
feel free to add some verbosity if you'd like (suitably wrapped with if( gd.verbose == 1)
The vdaggerv check took 85sec.
That matches what I saw as well.
How many of the phases ran in those 2300 seconds then?
One Phase took 2300sec...
After 3/6 Phases the program aborted due to the wall time limit.
One Phase took 2300sec...
that seems a bit long (I would get 1800 seconds if I had used 16 reading threads, see further below).
After 3/6 Phases the program aborted due to the wall time limit.
You should run in the batch partition then rather than in the devel one, to at least have a single job run from start to finish. (not necessary for testing, of course)
Please take a look at
/p/home/jusers/kostrzewa2/juwels/build/juwels/stage-2019a/skylake-avx512/intel_mpi_2019-intel_2019/slaph-contractions
to see how I compile the code. Note also the load_modules.sh
file which you can $ source load_modules.sh
before running $ ./do_cmake.sh
.
See /p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48
for my test jobs, in particular /p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48/jobs/nb_evec_read_threads8-nb_vdaggerv_eigen_threads32/outputs
where one phase (for 8 time slices) takes ~900 seconds.
See
/p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48
for my test jobs, in particular/p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48/jobs/nb_evec_read_threads8-nb_vdaggerv_eigen_threads32/outputs
where one phase (for 8 time slices) takes ~900 seconds.
Is there a way that I can launch your test for 16 reading threads and 32 eigenthreads for comparison? As I have seen you haven't performed this test yet.
Is there a way that I can launch your test for 16 reading threads and 32 eigenthreads for comparison? As I have seen you haven't performed this test yet.
copy the job script generator and run it?
Is there a way that I can launch your test for 16 reading threads and 32 eigenthreads for comparison? As I have seen you haven't performed this test yet.
copy the job script generator and run it?
No, I mean how I can use your executable. I cannot execute your script respectively I cannot access your home folder where I could see how you compiled it.
No, I mean how I can use your executable. I cannot execute your script respectively I cannot access your home folder where I could see how you compiled it.
I see. I only have symlinks in my home directory and I had forgotten that the permissions had changed so drastically. Look here instead:
/p/project/chbn28/kostrzewa2/build/juwels/stage-2019a/skylake-avx512/intel_mpi_2019-intel_2019/slaph-contractions
which is the path that I link to from my home directory (which you can't know, of course).
Note that there will not be much of a difference between different numbers of reading threads above 4, say, as VdaggerV is strongly dominated by the linear algebra. I would thus expect each phase to take 1800 seconds give or take 100 or so when the number of time slices per phase is 16.
Okay, I think I've fixed the behaviour with N reading threads and a single eigen thread to be correct (this was not covered by the tests, so it did not show up). @matfischer can you please confirm that the "old" way of running things now works again (
nb_vdaggerv_eigen_threads=1
,nb_evec_read_threads=8
) for the 32c64 lattice on qbig? Note that it will be a little slower than it was before because the VdaggerV check is included by default now, but this is a good thing.
The "old" way works on qbig/lnode07. Tested nb_evec_read_threads=8
and nb_vdaggerv_eigen_threads=1
. The vdaggerv
-part takes 556sec. When I set nb_evec_read_threads=1
and nb_vdaggerv_eigen_threads=8
then it takes 1000sec. However, memory consumption is smaller with the latter approach (not unexpected) which might be beneficial for large lattices.
The "old" way works on qbig/lnode07. Tested nb_evec_read_threads=8 and nb_vdaggerv_eigen_threads=1. The vdaggerv-part takes 556sec. When I set nb_evec_read_threads=1 and nb_vdaggerv_eigen_threads=8 then it takes 1000sec. However, memory consumption is smaller with the latter approach (not unexpected) which might be beneficial for large lattices.
In this, did you comment out the VdaggerV check? If yes, how does this compare to the original implementation (which did not do the check)?
In this, did you comment out the VdaggerV check? If yes, how does this compare to the original implementation (which did not do the check)?
No, I was primary interested in a timing run for the code with the check. Unfortunately I overwrote the old outputfile on qbig as we were - back then - just interested in the question if we were using all cores. Timing data for the old code I have only available on juwels. Shall I perform timing runs for the old (current) master branch on lnode07?
Besides that I was interested in the memory usage. We use around 4Gb for nb_evec_read_threads=1
and nb_vdaggerv_eigen_threads=8
at cA2.60.32. The "old" way with the new code costs ~10 Gb.
No, I was primary interested in a timing run for the code with the check. Unfortunately I overwrote the old outputfile on qbig as we were - back then - just interested in the question if we were using all cores. Timing data for the old code I have only available on juwels. Shall I perform timing runs for the old (current) master branch on lnode07?
That would be great, yes.
For the code in the present branch, it would be good if you could do another timing where you comment out the VdaggerV correctness check, such that the "old" and "new" codes are exactly comparable in terms of time for VdaggerV production.
For the code in the present branch, it would be good if you could do another timing where you comment out the VdaggerV correctness check, such that the "old" and "new" codes are exactly comparable in terms of time for VdaggerV production.
So for the "new" code I performed yesterday night timings without the VdaggerV
check. Time with check for read_threads=8
and eigen_threads=1
: 555sec., without: 523 sec., Memory: ~10Gb.
Time with check for read_threads=1
and eigen_threads=8
: 1025sec., without check: 984sec., Memory: ~4Gb.
That means the check costs ~5% wall time.
Then I will compare the current master branch to nb_evec_read_threads without the check.After that I start timings for cA2.09.48 on juwels with the "new" code (vary eigen-/read-threads) and check if the 5% wall time gain by the VdaggerV check can be confirmed on juwels as well. Any objections concerning my plan?
That means the check costs ~5% wall time.
Okay, this is more or less as expected then, thanks.
So for the "new" code I performed yesterday night timings without the VdaggerVcheck. Time with check for read_threads=8 and eigen_threads=1: 555sec., without: 523 sec., Memory: ~10Gb.
As for the total time, it seems that we have lost quite a bit compared to the numbers reported by @pittlerf above, but as I said above, I believe that @pittlerf was not running on lnode07 when performing the test.
Then I will compare the current master branch to nb_evec_read_threads without the check.
Yes, that's the logical next step. When I performed this check a few days ago, I obtained (with the master branch):
[ Finish] Eigenvector and Gauge I/O. Duration: 492.8026589742 seconds. Threads: 8
and it would be great if you could confirm this. Given that you measured 523 seconds, that would be a completely acceptable loss of performance compared to the gained flexibility.
Before then moving on to 48c96, could you please also recompile the new code with EIGEN_DONT_PARALLELIZE
commented back into CMakeLists.txt
and run with nb_evec_read_threads=8, nb_vdagger_eigen_threads=1
(make sure to delete CMakeCache.txt
and CMakeFiles
in the build directory)? I would like to understand where exactly the ~30 seconds are lost or if we're simply seeing a performance fluctuation (completely expected at this order of magnitude).
Second to last: it is very important that we also have a comparison of the time for the contractions themselves between the old and the new code due to the removal of EIGEN_DON_PARALLELIZE
.
Finally: even more important than performance is correctness and so it's important to compare the final correlation funtions in the three cases:
1) old code running with 8 threads
2) new code running with nb_evec_read_threads=8, nb_vdaggerv_eigen_threads=8
3) new code running with nb_evec_read_threads=8, nb_vdaggerv_eigen_threads=1
as (2) and (3) are not covered by the integration tests. This should complete our understanding of what we have done here.
Finally: even more important than performance is correctness and so it's important to compare the final correlation funtions in the three cases:
Note that when I say this, I don't imply any analysis. A simple side-by-side diff of the h5dump
output will be more than sufficient to make sure that things still work as expected.
…ing of eigensystems and the dense matrix multiply
@matfischer @pittlerf can you please test this?