support the use of different numbers of threads for the parallel read…

kostrzewa commented 5 years ago

…ing of eigensystems and the dense matrix multiply

@matfischer @pittlerf can you please test this?

pittlerf commented 5 years ago

…ing of eigensystems and the dense matrix multiply

@matfischer @pittlerf can you please test this?

…ing of eigensystems and the dense matrix multiply

@matfischer @pittlerf can you please test this?

Ok, I will test it now.

kostrzewa commented 5 years ago

@matfischer @pittlerf note that you have write privileges to this repo, please don't write into the master branch!

kostrzewa commented 5 years ago

I'm sure this will take some massaging until it works, but I guess the idea is apparent, yes? Please make sure that I didn't make any logic mistakes in the strip mining of the time slices.

pittlerf commented 5 years ago

I'm sure this will take some massaging until it works, but I guess the idea is apparent, yes? Please make sure that I didn't make any logic mistakes in the strip mining of the time slices.

Thanks, I am going through it carefully.

kostrzewa commented 5 years ago

Just to make sure: you're following what TravisCI finds, right?

kostrzewa commented 5 years ago

Alright, at least with default params it passes the CI.

kostrzewa commented 5 years ago

Your turn now to test on lnode07 with the 32c64 lattice how this compares against running without Eigen threads (with the same number of reading threads). And finally, to check the scaling for the 48c96 with 2 reading threads, increasing the number of nb_vdaggerv_eigen_threads.

kostrzewa commented 5 years ago

Is it clear what should be tested from the new parameters in global_data?

kostrzewa commented 5 years ago

The aim should certainly be to create either

1) a separate VdaggerV executable to replace the old one, such that one doesn't have to add hacks to the contract executable to avoid reading parambulators and excessive operator pre-computations

2) a clear flag (or sets of flags) for contract to put it into "pure VdaggerV" mode

kostrzewa commented 5 years ago

Have you gotten in touch with the Eigen developers? We still need to know if nested parallelism works in principle because it might help also for the actual contractions. I'm not at all sure that the contractions will scale resonably to 48 threads on Juwels. (have you tested this?)

pittlerf commented 5 years ago

Have you gotten in touch with the Eigen developers? We still need to know if nested parallelism works in principle because it might help also for the actual contractions. I'm not at all sure that the contractions will scale resonably to 48 threads on Juwels. (have you tested this?)

Hi, I have tested the VdaggerV computation with different number of eigenthreads on juwels. (See for example /p/scratch/chbn28/hbn28d/contractions/cA2.60.32/readthread8_eigenthread48 and /p/scratch/chbn28/hbn28d/contractions/cA2.60.32/readthread8_eigenthread1_old)

For this settings actually the old one seems to be faster.
old~6min new 48 eigenthread~10min

kostrzewa commented 5 years ago

Which commit did you test? For a while the Eigen threading was actually disabled

kostrzewa commented 5 years ago

Note that the outputs that you provide there have failed assertions or have been aborted.

pittlerf commented 5 years ago

Have you gotten in touch with the Eigen developers? We still need to know if nested parallelism works in principle because it might help also for the actual contractions. I'm not at all sure that the contractions will scale resonably to 48 threads on Juwels. (have you tested this?)

I think here we have a similar issue, the memory actually limits the number of total momentum combinations to be performed. For the small lattice we actually used 32OMP threads and considered 2 Total momentum at the same time. Matthias performed the contractions the time spent: 627912 sec

kostrzewa commented 5 years ago

Which commit did you test? For a while the Eigen threading was actually disabled

Looking at the output in /p/scratch/chbn28/hbn28d/contractions/cA2.60.32/readthread8_eigenthread6/outputs, you were testing commit f8f86401d1a6bbb3aedb1499c61ba89079a34510, which still had Eigen threading disabled in CMakeLists.txt, see https://github.com/HISKP-LQCD/sLapH-contractions/commits/nb_evec_read_threads . Looking at some of your other directories there, you were also testing some older commits. I think you'll need to repeat the tests when Juwels is back.

Scans to perform (both for 32c64 and 48c96):

nb_read_evec_threads \in [1,2,4,8,12,24] nb_vdaggerv_eigen_threads \in [1,2,4,6,8,12,16,24,32,48]

Note that for each combination you should of course use a different configuration number to avoid file system caching effects!

Just to make it clear: the thread numbers are now fully independent, no multiplication of threads

kostrzewa commented 5 years ago

@matfischer did you test the threading on lnode07 yet?

pittlerf commented 5 years ago

Which commit did you test? For a while the Eigen threading was actually disabled

Looking at the output in /p/scratch/chbn28/hbn28d/contractions/cA2.60.32/readthread8_eigenthread6/outputs, you were testing commit f8f8640, which still had Eigen threading disabled in CMakeLists.txt, see https://github.com/HISKP-LQCD/sLapH-contractions/commits/nb_evec_read_threads

Hi, Yes I know that, the ones with 48,32 threads are compiled without the EIGEN_DO_NOT_PARALLELIZE.

kostrzewa commented 5 years ago

Note also that in commits before those from last night, the VdaggerV trace and sum checks were performed very inefficiently.

pittlerf commented 5 years ago

Note also that in commits before those from last night, the VdaggerV trace and sum checks were performed very inefficiently.

I now set up a test on qbig for both cA2.60 and cA2.09

kostrzewa commented 5 years ago

I now set up a test on qbig for both cA2.60 and cA2.09

Just make sure that for each combination you use a new gauge configuration ID, otherwise caching effects (on the file system side) completely skew the measurements.

kostrzewa commented 5 years ago

Also note that it might well be that we actually lose a little performance on the smaller lattices, hopefully gaining quite a bit on the large ones where it really matters. (i.e., if VdaggerV takes 10 minutes on the 32c64, it will take about ten times longer on the 48c96 with good scaling and much more than 10 times longer if we can't use all cores)

kostrzewa commented 5 years ago

Note also that on Juwels it might make sense to bind to just a single socket (and use just 24 cores out of the 48 total). Perhaps there's even enough memory to run two copies of the VdaggerV code in parallel, each bound to one socket.

matfischer commented 5 years ago

@matfischer did you test the threading on lnode07 yet?

Unfortunately not yet, but I try it in the meantime. I have to figure out how to merge the last changes into my forked construction.

kostrzewa commented 5 years ago

Unfortunately not yet, but I try it in the meantime. I have to figure out how to merge the last changes into my forked construction.

Note that any hacks to prevent the reading of perambulators or the calculation of correlation functions should be removed anyway, as per https://github.com/HISKP-LQCD/sLapH-contractions/pull/98#issuecomment-509559809

matfischer commented 5 years ago

lnode07

In the meantime I found time to test the code. I didn't modified anything and what I can tell from the screenshot is that all cores on lnode07 do what we want them to do. So the test is from nb_evec_read_threads.

matfischer commented 5 years ago

Unfortunately not yet, but I try it in the meantime. I have to figure out how to merge the last changes into my forked construction.

Note that any hacks to prevent the reading of perambulators or the calculation of correlation functions should be removed anyway, as per #98 (comment)

On qbig everything was fine. I just have to figure out on juwels what I modified. Unfortunately there is a maintenance on juwels, so I cannot test it there. I hope that the contractions of moving frames P=2,3 have finished until the early morning as I launched them over night. P=0,1 had a cost of computation of around 6-20min.

pittlerf commented 5 years ago

=

Hi, I have test results in /hiskp4/pittler/contractions/cA2.60.32. The vdaggerv object generations with 8 threads using the old code takes 302.0805448527 seconds (directory /hiskp4/pittler/contractions/cA2.60.32/read8_eigenthread1_old) the new parallel section 654.7547035106 seconds (/hiskp4/pittler/contractions/cA2.60.32/read8_eigenthread8/outputs) I think, the reason can be the barriers between the parallel sections.

kostrzewa commented 5 years ago

The vdaggerv object generations with 8 threads using the old code 302 seconds (old) vs 650 seconds (new)

Good, can you repeat this for the 48c96 please (both versions will fit into memory with 8 reading threads)?

This indicates to me that it will be necessary to generalise this a bit more in order to support also the "old" way of doing things. It should be simple enough to implement:

1) move the actual VdaggerV calls into a separate kernel function 2) When nb_vdaggerv_eigen_threads==1, the old way of doing reading -> vdaggerv -> writing on a per-thread basis is used 3) in all other cases, the new implementation is used

I think, the reason can be the barriers between the parallel sections.

No, these rather infrequent barriers would not cause this kind of regression. You are right, however, that the loss of performance is due to overheads, mainly barriers and other associated synchronisation cost, but internal to Eigen. Doing a thread-parallel V[t].adjoint() * V[t] if of course less efficient than doing 8 of them naively parallellised as there is plenty of accumulation which requires atomics. This is doubly true on a dual socket system due to the associated cross-socket communication. I was just hoping that the impact would not be quite this high...

kostrzewa commented 5 years ago

Oh, btw, what kind of optimisations did you compile with? That's kind of important for threading performance.

pittlerf commented 5 years ago

Oh, btw, what kind of optimisations did you compile with? That's kind of important for threading performance.

Qbig: cmake \ -DCMAKE_C_COMPILER=cc \ -DCMAKE_CXX_COMPILER=c++ \ -DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=sandybridge -march=sandybridge -g' \ -DLIME_INCLUDE_DIRS=/qbigwork2/bartek/libs/lime/sandybridge/include \ -DLIME_LIBRARIES='-L/qbigwork2/bartek/libs/lime/sandybridge/lib -llime' \ /qbigwork2/pittler/code/sLapH-contractions_bartek

kostrzewa commented 5 years ago

-DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=sandybridge -march=sandybridge -g'

and you're sure that these were actually used? In your build directory, you don't seem to specify a build type. You can check by doing make clean && VERBOSE=1 make

pittlerf commented 5 years ago

-DCMAKE_CXX_FLAGS_RELEASE='-fopenmp -O3 -mtune=sandybridge -march=sandybridge -g'

and you're sure that these were actually used? In your build directory, you don't seem to specify a build type. You can check by doing make clean && VERBOSE=1 make

I just downloaded the new branch around lunch time, and I compiled it with ./do_cmake

kostrzewa commented 5 years ago

I just downloaded the new branch around lunch time, and I compiled it with ./do_cmake

Sure, that's clear enough. But since you're passing the CXX flags for the RELEASE build type, I'm not sure that these will actually be used when the build type is not defined.

kostrzewa commented 5 years ago

Sure, that's clear enough. But since you're passing the CXX flags for the RELEASE build type, I'm not sure that these will actually be used when the build type is not defined.

I just checked and it seems that they are passed, we're good then.

pittlerf commented 5 years ago

I just downloaded the new branch around lunch time, and I compiled it with ./do_cmake

Sure, that's clear enough. But since you're passing the CXX flags for the RELEASE build type, I'm not sure that these will actually be used when the build type is not defined.

In the meantime I recompile

kostrzewa commented 5 years ago

Executing cmake in a new directory, as you did:

$ ./do_cmake
$ VERBOSE=1 make
[...]
/opt/gcc-5.4.0/bin/c++   -I/qbigwork2/pittler/code/sLapH-contractions_bartek/include -I/qbigwork2/bartek/build/temp/sLapH-contractions/include -I/usr/include/hdf5/serial -isystem /usr/include/eigen3 -isystem /qbigwork2/bartek/libs/lime/sandybridge/include  -fopenmp -O3 -mtune=sandybridge -march=sandybridge -g -O2   -Wall -pedantic -fdiagnostics-color=auto -fopenmp -std=gnu++11 -o CMakeFiles/lcontract.dir/src/ComplexProduct.cpp.o -c /qbigwork2/pittler/code/sLapH-contractions_bartek/src/ComplexProduct.cpp

so all is good.

pittlerf commented 5 years ago

Executing cmake in a new directory, as you did:

$ ./do_cmake
$ VERBOSE=1 make
[...]
/opt/gcc-5.4.0/bin/c++   -I/qbigwork2/pittler/code/sLapH-contractions_bartek/include -I/qbigwork2/bartek/build/temp/sLapH-contractions/include -I/usr/include/hdf5/serial -isystem /usr/include/eigen3 -isystem /qbigwork2/bartek/libs/lime/sandybridge/include  -fopenmp -O3 -mtune=sandybridge -march=sandybridge -g -O2   -Wall -pedantic -fdiagnostics-color=auto -fopenmp -std=gnu++11 -o CMakeFiles/lcontract.dir/src/ComplexProduct.cpp.o -c /qbigwork2/pittler/code/sLapH-contractions_bartek/src/ComplexProduct.cpp

so all is good.

The output file for both reading and eigenthreads equal 8 gives the following: Eigenvector and Gauge I/O, VdaggerV construction. Duration: 654.7547035106 seconds. Threads: 1 Should not be the threads here 8 as well?

pittlerf commented 5 years ago

Executing cmake in a new directory, as you did:
$ ./do_cmake
$ VERBOSE=1 make
[...]
/opt/gcc-5.4.0/bin/c++   -I/qbigwork2/pittler/code/sLapH-contractions_bartek/include -I/qbigwork2/bartek/build/temp/sLapH-contractions/include -I/usr/include/hdf5/serial -isystem /usr/include/eigen3 -isystem /qbigwork2/bartek/libs/lime/sandybridge/include  -fopenmp -O3 -mtune=sandybridge -march=sandybridge -g -O2   -Wall -pedantic -fdiagnostics-color=auto -fopenmp -std=gnu++11 -o CMakeFiles/lcontract.dir/src/ComplexProduct.cpp.o -c /qbigwork2/pittler/code/sLapH-contractions_bartek/src/ComplexProduct.cpp
so all is good.
The output file for both reading and eigenthreads equal 8 gives the following: Eigenvector and Gauge I/O, VdaggerV construction. Duration: 654.7547035106 seconds. Threads: 1 Should not be the threads here 8 as well?

or now the swatch.Stop is in not in the OMP parallel region thats why it gives 1.

kostrzewa commented 5 years ago

Should not be the threads here 8 as well?

No, as I noted in the source code (to be fair, I didn't complete the sentence by mistake because I reworded it and then forgot to finish the sentene), the output of the stopwatch will be misleading for this. The stopwatch is started outside of a parallel section, so it uses a single thread (see the StopWatch source code).

The StopWatch should also be extended a little to be more flexible when it comes to single- and multi-threaded output.

pittlerf commented 5 years ago

The StopWatch should also be extended a little to be more flexible when it comes to single- and multi-threaded output.

One more results: old code 2thread: 1213.3022905490 new parallel 2read+4eigen: 934.8336688681

kostrzewa commented 5 years ago

One more results: old code 2thread: 1213.3022905490 new parallel 2read+4eigen: 934.8336688681

For the 32c64 on qbig?

pittlerf commented 5 years ago

One more results: old code 2thread: 1213.3022905490 new parallel 2read+4eigen: 934.8336688681

For the 32c64 on qbig?

yes

martin-ueding commented 5 years ago

I have just skimmed this conversation. In this particular project the build type defaults to release because most people seem to be unaware of CMake build types.

See around line 150 in CMakeLists.txt:

# Default to “Release” build type.
message(STATUS "Build Type: '${CMAKE_BUILD_TYPE}'")
if(CMAKE_BUILD_TYPE STREQUAL "")
  message(STATUS "No CMAKE_BUILD_TYPE, assuming release and enabling all optimizations.")
  set(CMAKE_BUILD_TYPE Release)
endif()

So not specifying it is no problem with the contraction code.

kostrzewa commented 5 years ago

So I had some trouble sleeping and worked a little more on this, making sure that there are no glaring issues. Testing on qbig (lnode07), there's some speedup for the 48c96 (running momentum shells 0 and 1 only to save time) (4 reading threads, 8 eigen threads):

8039 s (old) vs. 7200 s (new)

However, note that for each time slice, 16 seconds are used to do the VdaggerV check. This means that the actual comparison is

8039 s (old) vs. 5660 s (new)

So it seems like the strategy might work for large lattices in memory constrained environments. Also, the more momenta area added, the less important is the time spent in the VdaggerV check.

For 32c64, there's a performance regression by about a factor of 1.6 or so. As a result, it will be necessary to be able to run the "old way of doing things", simply threading of the time slices only. Finally, we need to make sure that the contraction itself is still efficient now that the code would be compiled with Eigen parallelisation enabled.

kostrzewa commented 5 years ago

I have just skimmed this conversation. In this particular project the build type defaults to release because most people seem to be unaware of CMake build types.

Ah, very good, thanks.

kostrzewa commented 5 years ago

@pittlerf note that the jobs that you were running on different kinds of nodes with different processes sharing the node are very difficult to interpret meaningful results out of, which is why I've been using lnode07, lnode06 and lnode09 to do these tests while they are still free of jobs. This way, things are nice and repeatable.

kostrzewa commented 5 years ago

@pittlerf

takes 302.0805448527 seconds (directory /hiskp4/pittler/contractions/cA2.60.32/read8_eigenthread1_old)

I get 500 seconds on lnode07, which node did you test on?

pittlerf commented 5 years ago

@pittlerf

takes 302.0805448527 seconds (directory /hiskp4/pittler/contractions/cA2.60.32/read8_eigenthread1_old)

I get 500 seconds on lnode07, which node did you test on?

I used a jobscript, and actually now do not know, on which node it was running

matfischer commented 5 years ago

Shall I start to download everything which is needed for the cA2.09.48 ensemble in Jülich and start testing the new code on juwels?

kostrzewa commented 5 years ago

I used a jobscript, and actually now do not know, on which node it was running

okay, but then the numbers that you quote are somewhat meaningless, right? First of all you can't tell what else was running on the node (node should be exclusive for these kinds of benchmarks) and secondly you can't even tell if it was running on a CPU that is almost ten years old or one that is just four years old.

HISKP-LQCD / sLapH-contractions

support the use of different numbers of threads for the parallel read… #98