Adds a new scalar performance test for a dot product of two Chapel arrays. This is inspired by the conversation in https://github.com/chapel-lang/chapel/issues/24864, and uses several of the kernels shown there.
Note that while this test is primarily testing scalar performance, it does use some of Chapels parallel features to serve as a baseline. For example, the "Chapeltastic" version is + reduce (A*B).
Here is the current state of performance (in seconds), note that the number of iterations varys to get the same trip counts, so different data sizes should result in similar performance with pure scalar performance. This is why the Chapeltastic version is the only one that gets better at a larger data size. Also note that I included the slices version in these tables, but not in the graphs, since it skewed the data too much.
Key
dotProdFor: a plain for loop over the data
dotProdForeach: a plain foreach loop over the data
dotProdChapeltastic: + reduce (A*B)
dotProdSlices: a foreach loop using array slices to do unrolling
dotProdParamFor: a foreach loop using an inner param for loop to do unrolling
dotProdParamForCArray: same as dotProdParamFor, but using c_array as the sum buffer
dotProdMetadataUnrollFor: same as dotProdFor, using @llvm.metadata to unroll the loop
dotProdMetadataUnrollForeach: same as dotProdForeach, using @llvm.metadata to unroll the loop
Arm M1 using real(64) with an unrollFactor of 4
kernel
N=5_000
N=500_000
dotProdFor
0.064663
0.058809
dotProdForeach
0.046577
0.047551
dotProdChapeltastic
0.042991
0.009235
dotProdSlices
31.9542
34.4907
dotProdParamFor
0.01656
0.016573
dotProdParamForCArray
0.016108
0.016922
dotProdMetadataUnrollFor
0.049084
0.04836
dotProdMetadataUnrollForeach
0.047245
0.049016
AMD EPYC 7543P using real(64) with an unrollFactor of 4
Adds a new scalar performance test for a dot product of two Chapel arrays. This is inspired by the conversation in https://github.com/chapel-lang/chapel/issues/24864, and uses several of the kernels shown there.
Note that while this test is primarily testing scalar performance, it does use some of Chapels parallel features to serve as a baseline. For example, the "Chapeltastic" version is
+ reduce (A*B)
.Here is the current state of performance (in seconds), note that the number of iterations varys to get the same trip counts, so different data sizes should result in similar performance with pure scalar performance. This is why the Chapeltastic version is the only one that gets better at a larger data size. Also note that I included the slices version in these tables, but not in the graphs, since it skewed the data too much.
Key
+ reduce (A*B)
c_array
as the sum buffer@llvm.metadata
to unroll the loop@llvm.metadata
to unroll the loopArm M1 using real(64) with an unrollFactor of 4
AMD EPYC 7543P using real(64) with an unrollFactor of 4