Open tkoskela opened 2 months ago
Performance profiling data was collected using benchmarks/matrix_multiply/si_442.xtl
. I ran it on myriad with 8 MPI ranks and 4 OpenMP threads using Intel(R) VTune(TM) Profiler 2022.2.0.
The main improvement to note is the time spent in m_kern_min
(103.7s -> 76.9s) and m_kern_max
(152.1s -> 85.2s), shown in the "Top Hotspots" table. As an interesting byproduct the time in kmp_fork_barrier
, ie. the OpenMP load imbalance has significantly decreased as well (560s -> 279.6s). This would make sense if there are not enough m_kern_min
and m_kern_max
calls to keep all threads busy. The serial calls are now faster and the idle threads wait for less time. I doubt anything has fundamentally improved with the load balance.
The time in dgemm
has also decreased (174s -> 95s). This is a bit surprising, since I would have expected us to be multiplying bigger matrices than previously.
On the topic of vectorized memory copies of B
and C
in m_kern_min
, this is what the data in Advisor looks like:
You may note that Advisor sees three loops on lines 411 and 412. I think one is the vectorized loop, one is the remaineder (the iterations not divisible by 4). It looks like the third is a scalar version, perhaps the compiler generates both vector and scalar instructions for the loop, it's somewhat unclear to me why Advisor would see both of them though. The important bit to see here is the Gain
column, which shows the gain from vectorization is less than 1. In the Trip Counts
column we see the average trip counts in the vectorized loops are 1 or 2, which explains the low gain. This data is collected using the LargerND input data set I got from @davidbowler
Here is data on one thread. These screenshots are a bit busy but the main points here are:
dgemm
is unchanged.m_kern_max
is reduced from 105s to 42sm_kern_min
is reduced from 75s to 69sRegarding the meaning of kmp_fork_barrier
, I followed the advice in this stackoverflow thread
You need to learn to use VTune better. It has specific OpenMP analyses which avoid you having to ask about the internals of the OpenMP runtime.
I ran the OpenMP threading analysis. In the screenshot below I've zoomed the graph at the bottom in to a typical 2s interval in the middle of the run, where primarily multiply_module
, calc_matrix_elements_module
and pao_grid_transform_module
are being called. I probably won't be able to explain things here, but let's discuss when we next meet.
My main observation here is that the main source of wasted time is Serial - outside parallel regions
, rather than thread imabalance in the parallel region as I previously thought. It's not clear to me why in the graph at the bottom the worker threads are sometimes red and sometimes green outside of parallel regions. Maybe this is an internal optimization in the runtime and it has some logic whether to keep the threads spinning or make them wait.
edit: In the graph below, the little blue bridge shapes represent parallel regions. VTune tells me that the OpenMP Region Duration Type is either Good
or Fast
in nearly all of them, which also reassures me the thread imbalance is not our main issue.
Expanding the Serial - outside parallel region
line, you get a breakdown of where it is coming from. There are some FFTs here, if you look closely :mag_right:
Just for my information, have you got numbers comparing ompGemm_m in v1.3 and in this version? It's not clear from above which kernel was used for the baseline.
This PR optimizes the memory copies in
m_kern_min
andm_kern_max
ofmultiply_kernel_ompGemm_m
. The intention is that this multiply kernel should be favoured in most (if not all) cases and the worse performing multiply kernels could be dropped.In
m_kern_max
we avoid making temporary copies of matricesA
andB
completely by callingdgemm
on the whole matrices including zero elements. This is the main performance gain. The zero elements are skipped when copying the temporary result back toC
, keeping the end result sparse.dgemm
also accepts 1D arrays as arguments, so the only remaining work before thedgemm
call is computing the start and end indices intoA
andB
. After thedgemm
call the result is copied intoC
from a temporary array.In
m_kern_min
we still make temporary copies ofB
andC
because both of them are stored in a sparse data structure. I've tried vectorizing the copies but the benefits are not great due to the typically small values ofnd1
andnd3
. No temporary copy is needed for the resultA
.The common index calculations of
m_kern_max
andm_kern_min
have been refactored into a separate subroutineprecompute_indices
to reduce code duplication.A number of unused variables have also been removed and comments added.
Also adds
system.myriad.make
since I've found myriad useful for testing and development