Closed bluss closed 8 years ago
Use a 4-by-8 microkernel for sgemm
Processing more data per kernel is a win. Some small sizes that need more zero padding regress by a few percent.
name sgemm4x4.log ns/iter sgemm4x8.log ns/iter diff ns/iter diff % mat_mul_f32::m004 153 172 19 12.42% mat_mul_f32::m007 280 271 -9 -3.21% mat_mul_f32::m008 305 286 -19 -6.23% mat_mul_f32::m012 598 650 52 8.70% mat_mul_f32::m016 1,044 945 -99 -9.48% mat_mul_f32::m032 5,037 4,638 -399 -7.92% mat_mul_f32::m064 30,305 27,748 -2,557 -8.44% mat_mul_f32::m127 208,380 188,977 -19,403 -9.31% mat_mul_f32::mix128x10000x128 16,291,680 14,293,288 -1,998,392 -12.27% nonative_mat_mul_f32::m127 291,936 272,487 -19,449 -6.66%
Use a 4-by-8 microkernel for sgemm
Processing more data per kernel is a win. Some small sizes that need more zero padding regress by a few percent.