Only add in masked kernel loop

SuperFluffy commented 5 years ago

The multiplication by alpha should be performed by the actual kernel. This leaves the masked kernel loop to only do addition when constructing the C matrix.

Looks like there is a reliable gain for f64 double precision of about 3-4% on my system (avx and fma enabled) at no cost:

 name                 with_scale_add ns/iter  no_scale_add ns/iter  diff ns/iter  diff %  speedup 
 layout_f32_032::ccc  1,765                   1,751                          -14  -0.79%   x 1.01 
 layout_f32_032::ccf  1,754                   1,762                            8   0.46%   x 1.00 
 layout_f32_032::cfc  2,019                   2,026                            7   0.35%   x 1.00 
 layout_f32_032::cff  2,030                   2,032                            2   0.10%   x 1.00 
 layout_f32_032::fcc  1,500                   1,495                           -5  -0.33%   x 1.00 
 layout_f32_032::fcf  1,484                   1,514                           30   2.02%   x 0.98 
 layout_f32_032::ffc  1,761                   1,772                           11   0.62%   x 0.99 
 layout_f32_032::fff  1,745                   1,768                           23   1.32%   x 0.99 
 layout_f64_032::ccc  2,229                   2,247                           18   0.81%   x 0.99 
 layout_f64_032::ccf  2,167                   2,173                            6   0.28%   x 1.00 
 layout_f64_032::cfc  2,381                   2,378                           -3  -0.13%   x 1.00 
 layout_f64_032::cff  2,310                   2,346                           36   1.56%   x 0.98 
 layout_f64_032::fcc  2,139                   2,163                           24   1.12%   x 0.99 
 layout_f64_032::fcf  2,054                   2,070                           16   0.78%   x 0.99 
 layout_f64_032::ffc  2,246                   2,284                           38   1.69%   x 0.98 
 layout_f64_032::fff  2,193                   2,214                           21   0.96%   x 0.99 
 mat_mul_f32::m004    145                     146                              1   0.69%   x 0.99 
 mat_mul_f32::m006    169                     172                              3   1.78%   x 0.98 
 mat_mul_f32::m008    145                     147                              2   1.38%   x 0.99 
 mat_mul_f32::m012    429                     424                             -5  -1.17%   x 1.01 
 mat_mul_f32::m016    407                     390                            -17  -4.18%   x 1.04 
 mat_mul_f32::m032    1,777                   1,753                          -24  -1.35%   x 1.01 
 mat_mul_f32::m064    11,299                  11,505                         206   1.82%   x 0.98 
 mat_mul_f32::m127    81,726                  81,467                        -259  -0.32%   x 1.00 
 mat_mul_f64::m004    148                     143                             -5  -3.38%   x 1.03 
 mat_mul_f64::m006    203                     203                              0   0.00%   x 1.00 
 mat_mul_f64::m008    204                     206                              2   0.98%   x 0.99 
 mat_mul_f64::m012    384                     393                              9   2.34%   x 0.98 
 mat_mul_f64::m016    466                     469                              3   0.64%   x 0.99 
 mat_mul_f64::m032    2,259                   2,224                          -35  -1.55%   x 1.02 
 mat_mul_f64::m064    14,159                  14,231                          72   0.51%   x 0.99 
 mat_mul_f64::m127    100,353                 96,578                      -3,775  -3.76%   x 1.04

This is for MMTEST_FEATURE=fallback:

 name                 fallback_with_scaled_add ns/iter  fallback_no_scaled_add ns/iter  diff ns/iter  diff %  speedup 
 layout_f32_032::ccc  3,416                             3,373                                    -43  -1.26%   x 1.01 
 layout_f32_032::ccf  3,417                             3,373                                    -44  -1.29%   x 1.01 
 layout_f32_032::cfc  3,621                             3,652                                     31   0.86%   x 0.99 
 layout_f32_032::cff  3,632                             3,607                                    -25  -0.69%   x 1.01 
 layout_f32_032::fcc  3,119                             3,078                                    -41  -1.31%   x 1.01 
 layout_f32_032::fcf  3,110                             3,073                                    -37  -1.19%   x 1.01 
 layout_f32_032::ffc  3,364                             3,361                                     -3  -0.09%   x 1.00 
 layout_f32_032::fff  3,451                             3,410                                    -41  -1.19%   x 1.01 
 layout_f64_032::ccc  6,152                             6,064                                    -88  -1.43%   x 1.01 
 layout_f64_032::ccf  6,139                             5,960                                   -179  -2.92%   x 1.03 
 layout_f64_032::cfc  6,299                             6,157                                   -142  -2.25%   x 1.02 
 layout_f64_032::cff  6,212                             6,116                                    -96  -1.55%   x 1.02 
 layout_f64_032::fcc  5,977                             5,913                                    -64  -1.07%   x 1.01 
 layout_f64_032::fcf  5,907                             5,856                                    -51  -0.86%   x 1.01 
 layout_f64_032::ffc  6,142                             6,007                                   -135  -2.20%   x 1.02 
 layout_f64_032::fff  6,050                             5,997                                    -53  -0.88%   x 1.01 
 mat_mul_f32::m004    70                                67                                        -3  -4.29%   x 1.04 
 mat_mul_f32::m006    117                               117                                        0   0.00%   x 1.00 
 mat_mul_f32::m008    148                               137                                      -11  -7.43%   x 1.08 
 mat_mul_f32::m012    401                               376                                      -25  -6.23%   x 1.07 
 mat_mul_f32::m016    609                               587                                      -22  -3.61%   x 1.04 
 mat_mul_f32::m032    3,373                             3,367                                     -6  -0.18%   x 1.00 
 mat_mul_f32::m064    22,463                            22,547                                    84   0.37%   x 1.00 
 mat_mul_f32::m127    160,537                           159,408                               -1,129  -0.70%   x 1.01 
 mat_mul_f64::m004    69                                65                                        -4  -5.80%   x 1.06 
 mat_mul_f64::m006    163                               160                                       -3  -1.84%   x 1.02 
 mat_mul_f64::m008    195                               190                                       -5  -2.56%   x 1.03 
 mat_mul_f64::m012    489                               472                                      -17  -3.48%   x 1.04 
 mat_mul_f64::m016    961                               939                                      -22  -2.29%   x 1.02 
 mat_mul_f64::m032    6,136                             6,001                                   -135  -2.20%   x 1.02 
 mat_mul_f64::m064    42,357                            41,979                                  -378  -0.89%   x 1.01 
 mat_mul_f64::m127    311,067                           309,980                               -1,087  -0.35%   x 1.00

bluss commented 5 years ago

This change makes sense, since it's a free win for the native kernels. Fallbacks need to be updated (like their debug assertion so neatly point out), so they do some more work, then, but that's probably good.

For the native kernels I'd pick a benchmark which uses the masked kernel, like the 127 size or something similar and benchmark that. For the fallback kernel we can just benchmark generally using any one of them as representative, why not the layout benchmarks.

bluss commented 5 years ago

Thanks!

bluss / matrixmultiply

Only add in masked kernel loop #41