Closed bluss closed 8 years ago
@petevine Is it possible for you to bench this on ARM (NEON) again? I'm not certain there will be larger wins there, but one could hope..
Sure, give me a minute.
running 41 tests
test mat_mul_f32::m004 ... bench: 1,635 ns/iter (+/- 27)
test mat_mul_f32::m005 ... bench: 2,220 ns/iter (+/- 32)
test mat_mul_f32::m006 ... bench: 2,515 ns/iter (+/- 89)
test mat_mul_f32::m007 ... bench: 2,752 ns/iter (+/- 315)
test mat_mul_f32::m008 ... bench: 2,864 ns/iter (+/- 44)
test mat_mul_f32::m009 ... bench: 5,879 ns/iter (+/- 383)
test mat_mul_f32::m012 ... bench: 7,680 ns/iter (+/- 71)
test mat_mul_f32::m016 ... bench: 11,891 ns/iter (+/- 545)
test mat_mul_f32::m032 ... bench: 68,891 ns/iter (+/- 1,118)
test mat_mul_f32::m064 ... bench: 511,802 ns/iter (+/- 11,075)
test mat_mul_f32::m127 ... bench: 4,011,067 ns/iter (+/- 80,360)
test mat_mul_f32::m256 ... bench: 32,911,345 ns/iter (+/- 244,070)
test mat_mul_f32::m512 ... bench: 262,133,757 ns/iter (+/- 410,091)
test mat_mul_f32::mix128x10000x128 ... bench: 336,832,987 ns/iter (+/- 624,222)
test mat_mul_f32::mix16x4 ... bench: 21,936 ns/iter (+/- 327)
test mat_mul_f32::mix32x2 ... bench: 18,080 ns/iter (+/- 322)
test mat_mul_f32::mix97 ... bench: 2,334,810 ns/iter (+/- 48,680)
test mat_mul_f64::m004 ... bench: 2,232 ns/iter (+/- 40)
test mat_mul_f64::m007 ... bench: 4,935 ns/iter (+/- 54)
test mat_mul_f64::m008 ... bench: 5,202 ns/iter (+/- 138)
test mat_mul_f64::m012 ... bench: 16,797 ns/iter (+/- 688)
test mat_mul_f64::m016 ... bench: 27,916 ns/iter (+/- 279)
test mat_mul_f64::m032 ... bench: 199,725 ns/iter (+/- 4,456)
test mat_mul_f64::m064 ... bench: 1,591,507 ns/iter (+/- 41,950)
test mat_mul_f64::m127 ... bench: 13,077,357 ns/iter (+/- 97,630)
test mat_mul_f64::m256 ... bench: 108,618,579 ns/iter (+/- 574,423)
test mat_mul_f64::m512 ... bench: 862,021,707 ns/iter (+/- 1,782,128)
test mat_mul_f64::mix128x10000x128 ... bench: 1,072,794,838 ns/iter (+/- 1,792,908)
test mat_mul_f64::mix16x4 ... bench: 41,538 ns/iter (+/- 2,493)
test mat_mul_f64::mix32x2 ... bench: 33,025 ns/iter (+/- 554)
test mat_mul_f64::mix97 ... bench: 7,830,734 ns/iter (+/- 94,210)
test ref_mat_mul_f32::m004 ... bench: 618 ns/iter (+/- 8)
test ref_mat_mul_f32::m005 ... bench: 1,070 ns/iter (+/- 19)
test ref_mat_mul_f32::m006 ... bench: 1,722 ns/iter (+/- 9)
test ref_mat_mul_f32::m007 ... bench: 2,605 ns/iter (+/- 18)
test ref_mat_mul_f32::m008 ... bench: 3,754 ns/iter (+/- 18)
test ref_mat_mul_f32::m009 ... bench: 5,548 ns/iter (+/- 56)
test ref_mat_mul_f32::m012 ... bench: 11,939 ns/iter (+/- 55)
test ref_mat_mul_f32::m016 ... bench: 27,278 ns/iter (+/- 255)
test ref_mat_mul_f32::m032 ... bench: 207,125 ns/iter (+/- 1,085)
test ref_mat_mul_f32::m064 ... bench: 1,655,207 ns/iter (+/- 23,070)
Thank you!
Let's put that through cargo-benchcmp and compare with the previous https://github.com/bluss/matrixmultiply/issues/6#issue-145536556 numbers, it looks like a similar speedup (maybe a bit more) for the small sizes. Diff for 64 might be a fluke.
name neontwoalloc.log ns/iter neononealloc.log ns/iter diff ns/iter diff %
mat_mul_f32::m004 2,026 1,635 -391 -19.30%
mat_mul_f32::m005 2,852 2,220 -632 -22.16%
mat_mul_f32::m006 2,955 2,515 -440 -14.89%
mat_mul_f32::m007 3,191 2,752 -439 -13.76%
mat_mul_f32::m008 3,404 2,864 -540 -15.86%
mat_mul_f32::m009 6,722 5,879 -843 -12.54%
mat_mul_f32::m012 8,575 7,680 -895 -10.44%
mat_mul_f32::m016 12,836 11,891 -945 -7.36%
mat_mul_f32::m032 72,610 68,891 -3,719 -5.12%
Either a fluke, or allocation overhead behaves differently on ARM.
I guess you'd get more consistent results on aarch64. And besides, the Cortex-A5
processor is about low power/efficiency above all else so the results are not entirely representative of the benefits one would expect from NEON.
I don't know NEON at all, but I'd first look at the larger matrix cases to evaluate it. For small enough matrices, there's a lot of overhead.
I suspected the overhead was related to 32-bit and it looks plausible; on i686:
test mat_mul_f32::m004 ... bench: 310 ns/iter (+/- 4)
test mat_mul_f32::m007 ... bench: 622 ns/iter (+/- 11)
test mat_mul_f32::m008 ... bench: 575 ns/iter (+/- 4)
test mat_mul_f32::m012 ... bench: 1,510 ns/iter (+/- 5)
test mat_mul_f32::m016 ... bench: 2,124 ns/iter (+/- 20)
test mat_mul_f64::m004 ... bench: 397 ns/iter (+/- 0)
test mat_mul_f64::m007 ... bench: 764 ns/iter (+/- 3)
test mat_mul_f64::m008 ... bench: 838 ns/iter (+/- 5)
test mat_mul_f64::m012 ... bench: 2,404 ns/iter (+/- 4)
test mat_mul_f64::m016 ... bench: 3,993 ns/iter (+/- 44)
@petevine Sorry, can you explain what I should compare that to? It looks good to me though.
I thought you you were pointing to the much smaller difference for 64 on your platform. Never mind, it seems I misread your meaning, having omitted any 64 results before.
Oh I said “Diff for 64 might be a fluke.” apparently but I meant that the diff for mat_mul_f32::m032 might be a fluke. Off by a power of two.., sorry :smile:
I've got another data point, coming from a 2GHz @64-bit ARM Cortex-A53 processor (compared with a 1.7GHz Cortex-A5, it should be about 50% faster, not counting 64-bit benefits):
running 41 tests
test mat_mul_f32::m004 ... bench: 1,098 ns/iter (+/- 12)
test mat_mul_f32::m005 ... bench: 1,587 ns/iter (+/- 22)
test mat_mul_f32::m006 ... bench: 1,718 ns/iter (+/- 21)
test mat_mul_f32::m007 ... bench: 1,923 ns/iter (+/- 9)
test mat_mul_f32::m008 ... bench: 2,138 ns/iter (+/- 9)
test mat_mul_f32::m009 ... bench: 4,260 ns/iter (+/- 33)
test mat_mul_f32::m012 ... bench: 5,484 ns/iter (+/- 40)
test mat_mul_f32::m016 ... bench: 8,621 ns/iter (+/- 74)
test mat_mul_f32::m032 ... bench: 48,623 ns/iter (+/- 183)
test mat_mul_f32::m064 ... bench: 328,603 ns/iter (+/- 1,405)
test mat_mul_f32::m127 ... bench: 2,387,223 ns/iter (+/- 34,915)
test mat_mul_f32::m256 ... bench: 20,490,803 ns/iter (+/- 180,891)
test mat_mul_f32::m512 ... bench: 164,162,031 ns/iter (+/- 545,316)
test mat_mul_f32::mix128x10000x128 ... bench: 216,476,447 ns/iter (+/- 358,523)
test mat_mul_f32::mix16x4 ... bench: 17,923 ns/iter (+/- 49)
test mat_mul_f32::mix32x2 ... bench: 15,748 ns/iter (+/- 62)
test mat_mul_f32::mix97 ... bench: 1,449,014 ns/iter (+/- 7,570)
test mat_mul_f64::m004 ... bench: 1,202 ns/iter (+/- 21)
test mat_mul_f64::m007 ... bench: 2,102 ns/iter (+/- 10)
test mat_mul_f64::m008 ... bench: 2,547 ns/iter (+/- 18)
test mat_mul_f64::m012 ... bench: 7,781 ns/iter (+/- 36)
test mat_mul_f64::m016 ... bench: 12,981 ns/iter (+/- 56)
test mat_mul_f64::m032 ... bench: 88,629 ns/iter (+/- 365)
test mat_mul_f64::m064 ... bench: 665,406 ns/iter (+/- 4,620)
test mat_mul_f64::m127 ... bench: 5,531,354 ns/iter (+/- 157,512)
test mat_mul_f64::m256 ... bench: 45,800,153 ns/iter (+/- 82,341)
test mat_mul_f64::m512 ... bench: 365,977,216 ns/iter (+/- 402,251)
test mat_mul_f64::mix128x10000x128 ... bench: 493,044,547 ns/iter (+/- 463,301)
test mat_mul_f64::mix16x4 ... bench: 20,585 ns/iter (+/- 62)
test mat_mul_f64::mix32x2 ... bench: 13,516 ns/iter (+/- 47)
test mat_mul_f64::mix97 ... bench: 3,202,931 ns/iter (+/- 74,201)
test ref_mat_mul_f32::m004 ... bench: 530 ns/iter (+/- 4)
test ref_mat_mul_f32::m005 ... bench: 941 ns/iter (+/- 2)
test ref_mat_mul_f32::m006 ... bench: 1,537 ns/iter (+/- 4)
test ref_mat_mul_f32::m007 ... bench: 2,689 ns/iter (+/- 30)
test ref_mat_mul_f32::m008 ... bench: 3,783 ns/iter (+/- 4)
test ref_mat_mul_f32::m009 ... bench: 5,307 ns/iter (+/- 66)
test ref_mat_mul_f32::m012 ... bench: 11,755 ns/iter (+/- 28)
test ref_mat_mul_f32::m016 ... bench: 26,824 ns/iter (+/- 42)
test ref_mat_mul_f32::m032 ... bench: 202,902 ns/iter (+/- 350)
test ref_mat_mul_f32::m064 ... bench: 1,584,415 ns/iter (+/- 3,420)
Use one Vec for both packing buffers
This removes a small overhead that is especially visible for small problem sizes.
Note: This library's primary focus is on large matrices, yet we will improve the small matrix multiplication problems when we can do so without penalty for the larger cases.