Add dot product scalar performance test

Adds a new scalar performance test for a dot product of two Chapel arrays. This is inspired by the conversation in https://github.com/chapel-lang/chapel/issues/24864, and uses several of the kernels shown there.

Note that while this test is primarily testing scalar performance, it does use some of Chapels parallel features to serve as a baseline. For example, the "Chapeltastic" version is + reduce (A*B).

Here is the current state of performance (in seconds), note that the number of iterations varys to get the same trip counts, so different data sizes should result in similar performance with pure scalar performance. This is why the Chapeltastic version is the only one that gets better at a larger data size. Also note that I included the slices version in these tables, but not in the graphs, since it skewed the data too much.

Key

dotProdFor: a plain for loop over the data
dotProdForeach: a plain foreach loop over the data
dotProdChapeltastic: + reduce (A*B)
dotProdSlices: a foreach loop using array slices to do unrolling
dotProdParamFor: a foreach loop using an inner param for loop to do unrolling
dotProdParamForCArray: same as dotProdParamFor, but using c_array as the sum buffer
dotProdMetadataUnrollFor: same as dotProdFor, using @llvm.metadata to unroll the loop
dotProdMetadataUnrollForeach: same as dotProdForeach, using @llvm.metadata to unroll the loop

Arm M1 using real(64) with an unrollFactor of 4

kernel	N=5_000	N=500_000
dotProdFor	0.064663	0.058809
dotProdForeach	0.046577	0.047551
dotProdChapeltastic	0.042991	0.009235
dotProdSlices	31.9542	34.4907
dotProdParamFor	0.01656	0.016573
dotProdParamForCArray	0.016108	0.016922
dotProdMetadataUnrollFor	0.049084	0.04836
dotProdMetadataUnrollForeach	0.047245	0.049016

AMD EPYC 7543P using real(64) with an unrollFactor of 4

kernel	N=5_000	N=500_000
dotProdFor	0.111996	0.111998
dotProdForeach	0.112165	0.111715
dotProdChapeltastic	0.291196	0.007845
dotProdSlices	60.7005	60.3888
dotProdParamFor	0.015189	0.014413
dotProdParamForCArray	0.014297	0.014056
dotProdMetadataUnrollFor	0.042118	0.042087
dotProdMetadataUnrollForeach	0.042037	0.041806

chapel-lang / chapel

Add dot product scalar performance test #24918