tkoskela commented 2 months ago

This PR optimizes the memory copies in m_kern_min and m_kern_max of multiply_kernel_ompGemm_m. The intention is that this multiply kernel should be favoured in most (if not all) cases and the worse performing multiply kernels could be dropped.

In m_kern_max we avoid making temporary copies of matrices A and B completely by calling dgemm on the whole matrices including zero elements. This is the main performance gain. The zero elements are skipped when copying the temporary result back to C, keeping the end result sparse. dgemm also accepts 1D arrays as arguments, so the only remaining work before the dgemm call is computing the start and end indices into A and B. After the dgemm call the result is copied into C from a temporary array.

In m_kern_min we still make temporary copies of B and C because both of them are stored in a sparse data structure. I've tried vectorizing the copies but the benefits are not great due to the typically small values of nd1 and nd3. No temporary copy is needed for the result A.

The common index calculations of m_kern_max and m_kern_min have been refactored into a separate subroutine precompute_indices to reduce code duplication.

A number of unused variables have also been removed and comments added.

Also adds system.myriad.make since I've found myriad useful for testing and development

tkoskela commented 1 month ago

Performance profiling data was collected using benchmarks/matrix_multiply/si_442.xtl. I ran it on myriad with 8 MPI ranks and 4 OpenMP threads using Intel(R) VTune(TM) Profiler 2022.2.0.

The main improvement to note is the time spent in m_kern_min (103.7s -> 76.9s) and m_kern_max (152.1s -> 85.2s), shown in the "Top Hotspots" table. As an interesting byproduct the time in kmp_fork_barrier, ie. the OpenMP load imbalance has significantly decreased as well (560s -> 279.6s). This would make sense if there are not enough m_kern_min and m_kern_max calls to keep all threads busy. The serial calls are now faster and the idle threads wait for less time. I doubt anything has fundamentally improved with the load balance.

The time in dgemm has also decreased (174s -> 95s). This is a bit surprising, since I would have expected us to be multiplying bigger matrices than previously.

Baseline: Develop branch

This PR

tkoskela commented 1 month ago

On the topic of vectorized memory copies of B and C in m_kern_min, this is what the data in Advisor looks like:

You may note that Advisor sees three loops on lines 411 and 412. I think one is the vectorized loop, one is the remaineder (the iterations not divisible by 4). It looks like the third is a scalar version, perhaps the compiler generates both vector and scalar instructions for the loop, it's somewhat unclear to me why Advisor would see both of them though. The important bit to see here is the Gain column, which shows the gain from vectorization is less than 1. In the Trip Counts column we see the average trip counts in the vectorized loops are 1 or 2, which explains the low gain. This data is collected using the LargerND input data set I got from @davidbowler

tkoskela commented 3 weeks ago

Here is data on one thread. These screenshots are a bit busy but the main points here are:

The total elapsed time is reduced from 505s to 401s
The time spent in dgemm is unchanged.
The time spent in m_kern_max is reduced from 105s to 42s
The time spent in m_kern_min is reduced from 75s to 69s

Develop branch

This PR

tkoskela commented 2 weeks ago

Regarding the meaning of kmp_fork_barrier, I followed the advice in this stackoverflow thread

You need to learn to use VTune better. It has specific OpenMP analyses which avoid you having to ask about the internals of the OpenMP runtime.

I ran the OpenMP threading analysis. In the screenshot below I've zoomed the graph at the bottom in to a typical 2s interval in the middle of the run, where primarily multiply_module, calc_matrix_elements_module and pao_grid_transform_module are being called. I probably won't be able to explain things here, but let's discuss when we next meet.

My main observation here is that the main source of wasted time is Serial - outside parallel regions, rather than thread imabalance in the parallel region as I previously thought. It's not clear to me why in the graph at the bottom the worker threads are sometimes red and sometimes green outside of parallel regions. Maybe this is an internal optimization in the runtime and it has some logic whether to keep the threads spinning or make them wait.

edit: In the graph below, the little blue bridge shapes represent parallel regions. VTune tells me that the OpenMP Region Duration Type is either Good or Fast in nearly all of them, which also reassures me the thread imbalance is not our main issue.

tkoskela commented 2 weeks ago

Expanding the Serial - outside parallel region line, you get a breakdown of where it is coming from. There are some FFTs here, if you look closely :mag_right:

davidbowler commented 4 days ago

Just for my information, have you got numbers comparing ompGemm_m in v1.3 and in this version? It's not clear from above which kernel was used for the baseline.

OrderN / CONQUEST-release

Optimize ompGemm_m multiply kernel #327

Baseline: Develop branch

This PR

Develop branch

This PR