chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.75k stars 410 forks source link

Add dot product scalar performance test #24918

Open jabraham17 opened 2 weeks ago

jabraham17 commented 2 weeks ago

Adds a new scalar performance test for a dot product of two Chapel arrays. This is inspired by the conversation in https://github.com/chapel-lang/chapel/issues/24864, and uses several of the kernels shown there.

Note that while this test is primarily testing scalar performance, it does use some of Chapels parallel features to serve as a baseline. For example, the "Chapeltastic" version is + reduce (A*B).

Here is the current state of performance (in seconds), note that the number of iterations varys to get the same trip counts, so different data sizes should result in similar performance with pure scalar performance. This is why the Chapeltastic version is the only one that gets better at a larger data size. Also note that I included the slices version in these tables, but not in the graphs, since it skewed the data too much.

Key

Arm M1 using real(64) with an unrollFactor of 4

kernel N=5_000 N=500_000
dotProdFor 0.064663 0.058809
dotProdForeach 0.046577 0.047551
dotProdChapeltastic 0.042991 0.009235
dotProdSlices 31.9542 34.4907
dotProdParamFor 0.01656 0.016573
dotProdParamForCArray 0.016108 0.016922
dotProdMetadataUnrollFor 0.049084 0.04836
dotProdMetadataUnrollForeach 0.047245 0.049016

AMD EPYC 7543P using real(64) with an unrollFactor of 4

kernel N=5_000 N=500_000
dotProdFor 0.111996 0.111998
dotProdForeach 0.112165 0.111715
dotProdChapeltastic 0.291196 0.007845
dotProdSlices 60.7005 60.3888
dotProdParamFor 0.015189 0.014413
dotProdParamForCArray 0.014297 0.014056
dotProdMetadataUnrollFor 0.042118 0.042087
dotProdMetadataUnrollForeach 0.042037 0.041806