HISKP-LQCD / sLapH-contractions

Stochastic LapH contraction program
GNU General Public License v3.0
3 stars 3 forks source link

support the use of different numbers of threads for the parallel read… #98

Closed kostrzewa closed 5 years ago

kostrzewa commented 5 years ago

…ing of eigensystems and the dense matrix multiply

@matfischer @pittlerf can you please test this?

kostrzewa commented 5 years ago

Shall I start to download everything which is needed for the cA2.09.48 ensemble in Jülich and start testing the new code on juwels?

I believe @pittlerf has started this to some extent, but certainly doing some initial comparisons of the VdaggerV part is not a bad idea. Don't forget that there are some coding milestones to attain too before this can be considered for merging:

pittlerf commented 5 years ago
  • into

Using the old code does corresponds to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?

kostrzewa commented 5 years ago

Using the old code does corresponds to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?

I don't quite understand what you mean, but if I guess what you're asking: currently it clearly doesn't. The old code would imply

nb_evec_read_threads=${OMP_NUM_THREADS}

and, via EIGEN_DONT_PARALLELIZE, nb_vdaggerv_eigen_threads=1 by definition, but of course ${OMP_NUM_THREADS} VdaggerV calls run in parallel.

kostrzewa commented 5 years ago

We also need to talk about how you guys compile slaph-contractions on Juwels and the sbatch settings that you use...

pittlerf commented 5 years ago

We also need to talk about how you guys compile slaph-contractions on Juwels and the sbatch settings that you use...

cmake \ -DCMAKE_C_COMPILER=icc \ -DCMAKE_CXX_COMPILER=icpc \ -DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=haswell -march=haswell -g' \ -DLIME_INCLUDE_DIRS=/p/home/jusers/pittler1/juwels/build/lime/include \ -DLIME_LIBRARIES='-L/p/home/jusers/pittler1/juwels/build/lime/lib -llime' \ /p/project/chbn28/hbn28d/code/sLaph_contractions_bartek/

pittlerf commented 5 years ago

We also need to talk about how you guys compile slaph-contractions on Juwels and the sbatch settings that you use...

cmake -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=haswell -march=haswell -g' -DLIME_INCLUDE_DIRS=/p/home/jusers/pittler1/juwels/build/lime/include -DLIME_LIBRARIES='-L/p/home/jusers/pittler1/juwels/build/lime/lib -llime' /p/project/chbn28/hbn28d/code/sLaph_contractions_bartek/

1) GCCcore/.8.3.0 (H) 19) Tcl/8.6.9 37) protobuf/.3.7.1 (H) 2) binutils/.2.32 (H) 20) SQLite/.3.27.2 (H) 38) gflags/.2.2.2 (H) 3) StdEnv (H) 21) expat/.2.2.6 (H) 39) libspatialindex/.1.9.0 (H) 4) icc/.2019.3.199-GCC-8.3.0 (H) 22) libpng/.1.6.36 (H) 40) NASM/.2.14.02 (H) 5) ifort/.2019.3.199-GCC-8.3.0 (H) 23) freetype/.2.10.0 (H) 41) libjpeg-turbo/.2.0.2 (H) 6) Intel/2019.3.199-GCC-8.3.0 24) gperf/.3.1 (H) 42) Python/3.6.8 7) pscom/.Default (H) 25) util-linux/.2.33.1 (H) 43) ICU/.64.1 (H) 8) numactl/2.0.12 26) fontconfig/.2.13.1 (H) 44) Boost/1.69.0-Python-3.6.8 9) nvidia/.418.40.04 (H,g) 27) X11/20190311 45) CMake/3.14.0 10) CUDA/10.1.105 (g) 28) Tk/.8.6.9 (H) 46) Bison/.3.3.2 (H) 11) UCX/1.5.1 29) GMP/6.1.2 47) flex/2.6.4 12) ParaStationMPI/5.2.2-1 30) XZ/.5.2.4 (H) 48) imkl/2019.3.199 13) zlib/.1.2.11 (H) 31) libxml2/.2.9.9 (H) 49) Eigen/3.3.7 14) Szip/.2.1.1 (H) 32) libxslt/.1.1.33 (H) 50) jscslurm/.17.11.12 (H,S) 15) HDF5/1.10.5 33) libffi/.3.2.1 (H) 51) jsctools/.0.1 (H,S) 16) bzip2/.1.0.6 (H) 34) libyaml/.0.2.2 (H) 52) .juwels-env (H) 17) ncurses/.6.1 (H) 35) Java/1.8 18) libreadline/.8.0 (H) 36) PostgreSQL/11.2

kostrzewa commented 5 years ago

Yes, and why compile with haswell on Juwels? (disregarding the fact that this way of specifying things for ICC is only a GCC-compatibility matter)

pittlerf commented 5 years ago

Yes, and why compile with haswell on Juwels? (disregarding the fact that this way of specifying things for ICC is only a GCC-compatibility matter)

Should we compile for skylake?

kostrzewa commented 5 years ago

Should we compile for skylake?

I guess so, either -xSKYLAKE-AVX512 or -xCORE-AVX2, whatever is faster in the end.

However, I have some bad news. It appears that on Juwels, scaling with the number of VdaggerV threads basically breaks down at 8 threads. The only other thing I can imagine trying is to launch two executables (for two separate configs), one bound to each socket ("multiple program multiple data") and then see if the scaling is better. Cross-socket communication is of course a major concern on Skylake and it might be that this is what we're seing. Another avenue to explore is to use MKL (https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html) via Eigen

pittlerf commented 5 years ago

Should we compile for skylake?

I guess so, either -xSKYLAKE-AVX512 or -xCORE-AVX2, whatever is faster in the end.

However, I have some bad news. It appears that on Juwels, scaling with the number of VdaggerV threads basically breaks down at 8 threads. The only other thing I can imagine trying is to launch two executables (for two separate configs), one bound to each socket ("multiple program multiple data") and then see if the scaling is better. Cross-socket communication is of course a major concern on Skylake and it might be that this is what we're seing. Another avenue to explore is to use MKL (https://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html) via Eigen

yep, Matthias started a job with 16 reading threads and it seemed to be really slow, I had a job with 2 read and 32 eigen and it took around 3 hour to finish

kostrzewa commented 5 years ago

yep, Matthias started a job with 16 reading threads and it seemed to be really slow, I had a job with 2 read and 32 eigen and it took around 3 hour to finish

Well, it's about a factor of two faster than on lnode07, but of course one would like to improve upon this...

pittlerf commented 5 years ago

yep, Matthias started a job with 16 reading threads and it seemed to be really slow, I had a job with 2 read and 32 eigen and it took around 3 hour to finish

Well, it's about a factor of two faster than on lnode07, but of course one would like to improve upon this...

My strategie would be to find the read thread which is the fastest when eigenthread is 48 . Do you agree on this?

kostrzewa commented 5 years ago

My strategie would be to find the read thread which is the fastest when eigenthread is 48 . Do you agree on this?

No, my strategy would be:

1) test the "old" way of doing things and see how much faster it is than this (it probably will be since we are able to run with 16 threads apparently) 2) if the difference is small, then the "new" method should be explored more thoroughly: socket pinning to avoid cross-socket comms, MPMD and usage of MKL

kostrzewa commented 5 years ago

Finally, one should note that if the cost for VdaggerV is not so large in total, then we should just go ahead with the calculation in the "old" way, archiving the VdaggerV for the future.

pittlerf commented 5 years ago

My strategie would be to find the read thread which is the fastest when eigenthread is 48 . Do you agree on this?

No, my strategy would be:

  1. test the "old" way of doing things and see how much faster it is than this (it probably will be since we are able to run with 16 threads apparently)
  2. if the difference is small, then the "new" method should be explored more thoroughly: socket pinning to avoid cross-socket comms, MPMD and usage of MKL

But then we compare to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?

kostrzewa commented 5 years ago

But then we compare to nb_evec_read_threads=nb_vdaggerv_eigen_threads ?

What?! The cost is totally dominated by the dense linear algebra (I/O takes only a few seconds per bunch of time slices). As a result, it's basically irrelevant what you set for nb_evec_read_threads, although 8 or 16 are good choices.

pittlerf commented 5 years ago

Finally, one should note that if the cost for VdaggerV is not so large in total, then we should just go ahead with the calculation in the "old" way, archiving the VdaggerV for the future.

I think we gain a lot, when computing VdaggerV for all Total momentum at the same time.

kostrzewa commented 5 years ago

I think we gain a lot, when computing VdaggerV for all Total momentum at the same time.

What?!

kostrzewa commented 5 years ago

When I say the "old" way, I mean of course pre-computing VdaggerV and storing it.

pittlerf commented 5 years ago

I think we gain a lot, when computing VdaggerV for all Total momentum at the same time.

What?!

Yes, that is what is meant, doing VdaggerV for all the total momenta and save it, then read it.

kostrzewa commented 5 years ago

It would still be great to have this "new" method formalised in the code as an option, as long as it doesn't affect overall performance.

kostrzewa commented 5 years ago

Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.

pittlerf commented 5 years ago

Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.

yes, that Marcus suggested as well

pittlerf commented 5 years ago

Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.

yes, that Marcus suggested as well

So, now I make a scan with the old code with threads 2,4,8,16,24,32 and look for the maximum

pittlerf commented 5 years ago

Btw, there's a simple optimisation which should be taken for VdaggerV without displacements: doing the accumulation over the colour degree of freedom before performing momentum projections.

yes, that Marcus suggested as well

So, now I make a scan with the old code with threads 2,4,8,16,24,32 and look for the maximum

I have made some test and concluded that with 32 eigenthread we have the optimal settings. I will try running to at the same time for read_threads=1

kostrzewa commented 5 years ago

The checks are not passing now. You pass vdaggerv by value, that cannot work.

kostrzewa commented 5 years ago

I'm cleaning up some things so I'll invalidate your push in a second

kostrzewa commented 5 years ago

Okay, I think I've fixed the behaviour with N reading threads and a single eigen thread to be correct (this was not covered by the tests, so it did not show up). @matfischer can you please confirm that the "old" way of running things now works again (nb_vdaggerv_eigen_threads=1, nb_evec_read_threads=8) for the 32c64 lattice on qbig? Note that it will be a little slower than it was before because the VdaggerV check is included by default now, but this is a good thing.

kostrzewa commented 5 years ago

To me it seems that it works fine now, but I don't have numbers at hand to compare.

I have made some test and concluded that with 32 eigenthread we have the optimal settings. I will try running to at the same time for read_threads=1

Yeah, on Juwels for the 48c96 there is a "performance resonance" at nb_vdaggerv_eigen_threads=16 and nb_vdaggerv_eigen_threads=32 (with 32 faster than 16) where it seems that things fit well for Eigen to parallelize VdaggerV well. I estimate a VdaggerV time of around 3 hours per config for this.

@pittlerf @matfischer Do you also have numbers already for the 48c96 on Juwels and how the "old" trivially parallel way (with as many threads as fit into memory) and the "new" implementation compare?

kostrzewa commented 5 years ago

I'll decouple now. @martin-ueding if you find some time when you're back, a scan of the changes would be much appreciated and a 'yay' or 'nay' on the "only_vdaggerv_compute_save" mode for handling_vdaggerv, abusing contract as a VdaggerV generator. The thing that I'm not happy about here is that the input file for this is kind of hackish, having to specify an operator and a correlator instead of simply passing some max momentum shell number.

matfischer commented 5 years ago

@pittlerf @matfischer Do you also have numbers already for the 48c96 on Juwels and how the "old" trivially parallel way (with as many threads as fit into memory) and the "new" implementation compare?

No, so far I have no numbers for the "old" way on juwels. For the "new" way I wasn't even able to receive any kind of output. I used 16 reading threads and 32 eigen threads. So the program aborted after 3 h.

matfischer commented 5 years ago

Okay, I think I've fixed the behaviour with N reading threads and a single eigen thread to be correct (this was not covered by the tests, so it did not show up). @matfischer can you please confirm that the "old" way of running things now works again (nb_vdaggerv_eigen_threads=1, nb_evec_read_threads=8) for the 32c64 lattice on qbig? Note that it will be a little slower than it was before because the VdaggerV check is included by default now, but this is a good thing.

So, you mean testing the new contraction executable just for vdaggerv creation where I use not all operators?

kostrzewa commented 5 years ago

No, so far I have no numbers for the "old" way on juwels. For the "new" way I wasn't even able to receive any kind of output. I used 16 reading threads and 32 eigen threads. So the program aborted after 3 h.

There is a time measurement for each phase (bunch of time slices) which allows you to easily compute how long the entire computation will take. What do you mean "any kind of output"?

So, you mean testing the new contraction executable just for vdaggerv creation where I use not all operators?

Yes, setting nb_vdaggerv_eigen_threads=1 and nb_evec_read_threads=8. The time for VdaggerV creation should be compared to the time that the old code took. Note that the VdaggerV check should be disabled here for fairness because the old code did not perform it (just comment it out of the kernel function for this test)

Finally, it is imperative that also the contraction time itself is checked for any performance regressions as a result of removing EIGEN_DONT_PARALLELIZE. If we see a slow-down, this will mean that one will need to compile two versions of contract for production purposes: one for VdaggerV and one for running the actual contractions. As a result, whether or not ``EIGEN_DONT_PARALLELIZE``` is defined should be elevated to a CMake parameter.

kostrzewa commented 5 years ago

No, so far I have no numbers for the "old" way on juwels.

Okay, then you should produce some.

matfischer commented 5 years ago

There is a time measurement for each phase (bunch of time slices) which allows you to easily compute how long the entire computation will take. What do you mean "any kind of output"?

No vdaggerv object was written in the output-folder and the outputfile didn't mentioned such a process as well. Only reading threads have been written.The outer loop took 2300 sec and the outputfile does not specify what was performed within that outer loop. The vdaggerv check took 85sec.

kostrzewa commented 5 years ago

No vdaggerv object was written in the output-folder and the outputfile didn't mentioned such a process as well.

Did you read the source code? The "operator" files are written at the very end.

Only reading threads have been written.

I don't understand what this means.

The outer loop took 2300 sec and

How many of the phases ran in those 2300 seconds then?

the outputfile does not specify what was performed within that outer loop.

feel free to add some verbosity if you'd like (suitably wrapped with if( gd.verbose == 1)

The vdaggerv check took 85sec.

That matches what I saw as well.

matfischer commented 5 years ago

How many of the phases ran in those 2300 seconds then?

One Phase took 2300sec...

After 3/6 Phases the program aborted due to the wall time limit.

kostrzewa commented 5 years ago

One Phase took 2300sec...

that seems a bit long (I would get 1800 seconds if I had used 16 reading threads, see further below).

After 3/6 Phases the program aborted due to the wall time limit.

You should run in the batch partition then rather than in the devel one, to at least have a single job run from start to finish. (not necessary for testing, of course)

Please take a look at

/p/home/jusers/kostrzewa2/juwels/build/juwels/stage-2019a/skylake-avx512/intel_mpi_2019-intel_2019/slaph-contractions

to see how I compile the code. Note also the load_modules.sh file which you can $ source load_modules.sh before running $ ./do_cmake.sh.

See /p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48 for my test jobs, in particular /p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48/jobs/nb_evec_read_threads8-nb_vdaggerv_eigen_threads32/outputs where one phase (for 8 time slices) takes ~900 seconds.

matfischer commented 5 years ago

See /p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48 for my test jobs, in particular /p/scratch/chbn28/hbn288/contractions/nf2/cA2a.09.48/jobs/nb_evec_read_threads8-nb_vdaggerv_eigen_threads32/outputs where one phase (for 8 time slices) takes ~900 seconds.

Is there a way that I can launch your test for 16 reading threads and 32 eigenthreads for comparison? As I have seen you haven't performed this test yet.

kostrzewa commented 5 years ago

Is there a way that I can launch your test for 16 reading threads and 32 eigenthreads for comparison? As I have seen you haven't performed this test yet.

copy the job script generator and run it?

matfischer commented 5 years ago

Is there a way that I can launch your test for 16 reading threads and 32 eigenthreads for comparison? As I have seen you haven't performed this test yet.

copy the job script generator and run it?

No, I mean how I can use your executable. I cannot execute your script respectively I cannot access your home folder where I could see how you compiled it.

kostrzewa commented 5 years ago

No, I mean how I can use your executable. I cannot execute your script respectively I cannot access your home folder where I could see how you compiled it.

I see. I only have symlinks in my home directory and I had forgotten that the permissions had changed so drastically. Look here instead:

/p/project/chbn28/kostrzewa2/build/juwels/stage-2019a/skylake-avx512/intel_mpi_2019-intel_2019/slaph-contractions

which is the path that I link to from my home directory (which you can't know, of course).

kostrzewa commented 5 years ago

Note that there will not be much of a difference between different numbers of reading threads above 4, say, as VdaggerV is strongly dominated by the linear algebra. I would thus expect each phase to take 1800 seconds give or take 100 or so when the number of time slices per phase is 16.

matfischer commented 5 years ago

Okay, I think I've fixed the behaviour with N reading threads and a single eigen thread to be correct (this was not covered by the tests, so it did not show up). @matfischer can you please confirm that the "old" way of running things now works again (nb_vdaggerv_eigen_threads=1, nb_evec_read_threads=8) for the 32c64 lattice on qbig? Note that it will be a little slower than it was before because the VdaggerV check is included by default now, but this is a good thing.

The "old" way works on qbig/lnode07. Tested nb_evec_read_threads=8 and nb_vdaggerv_eigen_threads=1. The vdaggerv-part takes 556sec. When I set nb_evec_read_threads=1 and nb_vdaggerv_eigen_threads=8 then it takes 1000sec. However, memory consumption is smaller with the latter approach (not unexpected) which might be beneficial for large lattices.

kostrzewa commented 5 years ago

The "old" way works on qbig/lnode07. Tested nb_evec_read_threads=8 and nb_vdaggerv_eigen_threads=1. The vdaggerv-part takes 556sec. When I set nb_evec_read_threads=1 and nb_vdaggerv_eigen_threads=8 then it takes 1000sec. However, memory consumption is smaller with the latter approach (not unexpected) which might be beneficial for large lattices.

In this, did you comment out the VdaggerV check? If yes, how does this compare to the original implementation (which did not do the check)?

matfischer commented 5 years ago

In this, did you comment out the VdaggerV check? If yes, how does this compare to the original implementation (which did not do the check)?

No, I was primary interested in a timing run for the code with the check. Unfortunately I overwrote the old outputfile on qbig as we were - back then - just interested in the question if we were using all cores. Timing data for the old code I have only available on juwels. Shall I perform timing runs for the old (current) master branch on lnode07?

Besides that I was interested in the memory usage. We use around 4Gb for nb_evec_read_threads=1 and nb_vdaggerv_eigen_threads=8 at cA2.60.32. The "old" way with the new code costs ~10 Gb.

kostrzewa commented 5 years ago

No, I was primary interested in a timing run for the code with the check. Unfortunately I overwrote the old outputfile on qbig as we were - back then - just interested in the question if we were using all cores. Timing data for the old code I have only available on juwels. Shall I perform timing runs for the old (current) master branch on lnode07?

That would be great, yes.

For the code in the present branch, it would be good if you could do another timing where you comment out the VdaggerV correctness check, such that the "old" and "new" codes are exactly comparable in terms of time for VdaggerV production.

matfischer commented 5 years ago

For the code in the present branch, it would be good if you could do another timing where you comment out the VdaggerV correctness check, such that the "old" and "new" codes are exactly comparable in terms of time for VdaggerV production.

So for the "new" code I performed yesterday night timings without the VdaggerVcheck. Time with check for read_threads=8 and eigen_threads=1: 555sec., without: 523 sec., Memory: ~10Gb.

Time with check for read_threads=1 and eigen_threads=8: 1025sec., without check: 984sec., Memory: ~4Gb.

That means the check costs ~5% wall time.

Then I will compare the current master branch to nb_evec_read_threads without the check.After that I start timings for cA2.09.48 on juwels with the "new" code (vary eigen-/read-threads) and check if the 5% wall time gain by the VdaggerV check can be confirmed on juwels as well. Any objections concerning my plan?

kostrzewa commented 5 years ago

That means the check costs ~5% wall time.

Okay, this is more or less as expected then, thanks.

So for the "new" code I performed yesterday night timings without the VdaggerVcheck. Time with check for read_threads=8 and eigen_threads=1: 555sec., without: 523 sec., Memory: ~10Gb.

As for the total time, it seems that we have lost quite a bit compared to the numbers reported by @pittlerf above, but as I said above, I believe that @pittlerf was not running on lnode07 when performing the test.

Then I will compare the current master branch to nb_evec_read_threads without the check.

Yes, that's the logical next step. When I performed this check a few days ago, I obtained (with the master branch):

[   Finish] Eigenvector and Gauge I/O. Duration: 492.8026589742 seconds. Threads: 8

and it would be great if you could confirm this. Given that you measured 523 seconds, that would be a completely acceptable loss of performance compared to the gained flexibility.

Before then moving on to 48c96, could you please also recompile the new code with EIGEN_DONT_PARALLELIZE commented back into CMakeLists.txt and run with nb_evec_read_threads=8, nb_vdagger_eigen_threads=1 (make sure to delete CMakeCache.txt and CMakeFiles in the build directory)? I would like to understand where exactly the ~30 seconds are lost or if we're simply seeing a performance fluctuation (completely expected at this order of magnitude).

Second to last: it is very important that we also have a comparison of the time for the contractions themselves between the old and the new code due to the removal of EIGEN_DON_PARALLELIZE.

Finally: even more important than performance is correctness and so it's important to compare the final correlation funtions in the three cases:

1) old code running with 8 threads 2) new code running with nb_evec_read_threads=8, nb_vdaggerv_eigen_threads=8 3) new code running with nb_evec_read_threads=8, nb_vdaggerv_eigen_threads=1

as (2) and (3) are not covered by the integration tests. This should complete our understanding of what we have done here.

kostrzewa commented 5 years ago

Finally: even more important than performance is correctness and so it's important to compare the final correlation funtions in the three cases:

Note that when I say this, I don't imply any analysis. A simple side-by-side diff of the h5dump output will be more than sufficient to make sure that things still work as expected.