Thread loops over blocks

tkoskela commented 11 months ago

PR to close #178

This PR adds threading over the biggest loops over blocks in calc_matrix_elements_module and PAO_grid_transform_module.

In calc_matrix_elements_module threading is done in subroutines get_matrix_elements_new and act_on_vectors_new. Both subroutines have a deep loop nest where the loop over blocks is the outermost loop. In get_matrix_elements_new the innermost loop accumulates data in send_array, therefore a reduction is done at the end of the parallel region. In get_matrix_elements_new, gridfunctions%griddata is updated with a gemm call in the innermost loop. The updates are done in separate indices by each loop iteration, therefore it can be operated on by all threads in parallel.

In PAO_grid_transform_module, the single_PAO_to_grid subroutine is rewritten so that threading over blocks can be done. The innermost loop updates gridfunctions%griddata. However the index was calculated sequentially inside the loop, which made it unsafe for threading. The rewritten subroutine first precomputes the indices and stores them in an array, then does the loop over blocks which can now safely be threaded. The single_PAO_to_grad subroutine has not been refactored, it will be removed in a separate PR and the functionality merged with single_PAO_to_grid to address #244

tkoskela commented 11 months ago

Added threading to one of the loops in calc_matrix_elements. Did not check performance yet.

Test pass. However, I tested with send_array as a shared variable, which should create a data race, and I had to go to 9 decimals before the testsuite picked up a difference with the reduction implementation. Against the reference data currently in the tests, my tests fail at 7 decimals (with no changes to develop). That is a bit concerning -- we don't have a test that is accurate enough to pick up the race condition.

tkoskela commented 11 months ago

I just realized the values in the output are only written out at 8 decimals. So my race condition is not producing any observable error in the tests in the testsuite. Very strange :thinking:

davidbowler commented 11 months ago

I have tested the attached file (216 atoms which scales happily enough to 9 MPI processes) on combinations from 1 to 8 MPI processes with 1 to 4 OpenMP threads and found the same energy to 10 decimal places for all cases. I'm not sure if this helps!

LiCltest.tgz

tkoskela commented 11 months ago

It's always awkward when the code is producing correct results and you don't understand why :grin:

I wrote this bit of code to test the concept and it does indeed produce a race condition when I don't declare array_to_reduce as a reduction variable. I don't quite understand what is different in calc_matrix_elements but I'll keep digging.

tkoskela commented 8 months ago

I think this PR is now ready to merge to development. We can address #244 in a separate PR to avoid cluttering this one further. Here is a performance comparison to development. I ran these on young using 8 MPI ranks and the attached input files input_data.zip.

	develop	This PR
Without `-fopenmp`	95.331s	63.482s
With `-fopenmp`, 1 OMP thread	95.597s	62.324s
With `-fopenmp`, 4 OMP threads		36.491 s

Even the serial performance is significantly improved! I've started work on the code for #244 and it looks like we can gain a bit more speedup in threaded performance.

davidbowler commented 8 months ago

Running the LiCl test (input files above) on a local cluster (AMD, Intel compiler 2020_u4, MKL, OpenMPI 4.1.3) I get a segfault when using two threads and 9 MPI processes (the fault seems to come from a reduction). Another segfault occurs using GCC 11.2.0 on the same system. On a Mac MBP M2 with GCC 12.2 and OpenMPI 4.1.5 the same test runs without problems.

davidbowler commented 8 months ago

I find the same problem with your test above

tkoskela commented 8 months ago

Interesting, is it possible to get access to the cluster?

On young I'm using

compilers/intel/2022.2
mpi/intel/2021.6.0/intel

gcc and openmpi are available there as well, I can see if I can build with them

davidbowler commented 8 months ago

The problem comes from calc_matrix_elements_new (found by trial and error - commenting out different areas) but I don't know more yet.

davidbowler commented 8 months ago

We are now confident that the problem here is coming from the OpenMP stack (the errors I found were caused by a stack overflow...). Setting the environment variable OMP_STACKSIZE fixes this issue (in particular, I found that a value somewhere around 50M was helpful though this might require a little testing).

I will add to the documentation before approving the PR.

OrderN / CONQUEST-release

Thread loops over blocks #195