Closed dlewissandy closed 7 years ago
This function has the following mathematical specification:
forall n>0, incx>0, 0<= i <= (n-1)*inc
(sscal n sa sx incx)!i =
if (i `mod` incx) == 0 then sa * u!i else u!i
When incx = 1, this corresponds to the vectorspace operation of scalar multiplication.
BLAS Implementation notes:
Design Choices:
I have created a branch (ISSUE-42) to begin the implementation of this function.
I have collected the time performance data for the native implementation. I am attaching the raw data to this post sscal.csv.zip
I anticipate that for inc=1 that the time performance of the stream based implementation will be proportional to the vector length, or O(inc*(n-1)). For the unsafe implementation, I expect a time complexity proportional to the number of scaled entries, O(n).
Summary statistics for the fit of the native implementation to the model time(n,z) = an + bz + c where z = (n-1)*inc follow:
Residuals:
Min 1Q Median 3Q Max
-5.4225 -0.0249 -0.0064 0.1040 5.9975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.014e-02 1.460e-01 0.343 0.732
N 8.098e-04 3.811e-05 21.246 <2e-16
Z 5.960e-04 2.410e-05 24.733 <2e-16
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.21 on 87 degrees of freedom
Multiple R-squared: 0.9672, Adjusted R-squared: 0.9664
F-statistic: 1282 on 2 and 87 DF, p-value: < 2.2e-16
Summary statistics for the unsafe implementation to the model time(n,z)=an+bz+c where z=(n-1)*inc follow:
lm(formula = UNSAFE ~ N + Z, data = sscal)
Residuals:
Min 1Q Median 3Q Max
-2.25448 -0.03327 0.04609 0.05948 2.55762
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.014e-02 6.230e-02 0.163 0.871
N 1.175e-04 1.627e-05 7.223 1.82e-10 ***
Z 3.178e-04 1.029e-05 30.893 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5167 on 87 degrees of freedom
Multiple R-squared: 0.957, Adjusted R-squared: 0.956
F-statistic: 968 on 2 and 87 DF, p-value: < 2.2e-16
In both cases, the P value for the coefficients of N and Z are small, indicating that we can reject the null hypothesis that a and b are zero. The fact that the coefficient for Z is non-zero for the unsafe model is somewhat suprising, and may be an indication that the cost of marshaling a mutable copy of the vector has not been separated from the cost of running the algorithm.
From this data, we can expect that the stream based implementation will have an asymptotic relative execution time given by the following expression:
8.1E-4 + 6E-4 *INC
----------------------
1.2E-4 + 3.2E-4 * INC
This corresponds to about 6.7x the unsafe call when INC is equal to 1, and about 1.875x the unsafe call in the case when both N and INC are very large. These numbers are well within the 100x goal for this project milestone.
The foreign calls to sscal currently use the thaw function to create a mutable vector for use in the benchmarking. I have temporarily replaced this call with unsafeThaw to test if it eliminates the unexpected dependence of the time performance on Z. The regression results for the case when INC=1 are as follows:
lm(formula = UNSAFE_NOCOPY ~ 0 + N, data = unrolled)
Residuals:
Min 1Q Median 3Q Max
-0.274000 0.005778 0.035665 0.038943 0.155501
Coefficients:
Estimate Std. Error t value Pr(>|t|)
N 1.348e-04 2.331e-06 57.85 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09576 on 38 degrees of freedom
Multiple R-squared: 0.9888, Adjusted R-squared: 0.9885
F-statistic: 3347 on 1 and 38 DF, p-value: < 2.2e-16
and When INC>1
all:
lm(formula = UNSAFE_NOCOPY ~ 0 + N, data = rolled)
Residuals:
Min 1Q Median 3Q Max
-0.53339 -0.01117 0.02548 0.03646 0.32093
Coefficients:
Estimate Std. Error t value Pr(>|t|)
N 1.279e-03 3.615e-05 35.38 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1493 on 50 degrees of freedom
Multiple R-squared: 0.9616, Adjusted R-squared: 0.9608
F-statistic: 1252 on 1 and 50 DF, p-value: < 2.2e-16
The unexplained Z dependence has been eliminated :)
Issue completed with the approval of pull request #56 and merged into master.
Documentation for the
sscal
function can be found at BLAS.