Closed tkoskela closed 7 months ago
I ran a comparison in Vtune. In my test, develop
is outpefroming #266 in total run time. However there are some interesting differences.
__kmp_fork_barrier
. However it has 50% more time spent in __kmpc_barrier
. My initial interpretation is that the serialisation of communication and computation we've forced is causing a lot of time to be wasted at the barriers we've put in. #265 seems like the obvious direction to look at next.
Change
DM.L_range
to 20 or more.Testing this with both
develop
and #266 usingmatrix_multiply
input