Idein / py-videocore6

Python library for GPGPU programming on Raspberry Pi 4
https://idein.jp
GNU General Public License v2.0
244 stars 28 forks source link

sgemm.py: Remove an extra load when the number of params is a multiple of 16 #33

Closed Terminus-IMRC closed 4 years ago

Terminus-IMRC commented 4 years ago

This is not so significant, but the condition for the case that the number of params is a multiple of 16 was wrong. This PR fixes that.

Assembler output for n = 16 case before the fix:

or  tmua, r0, r0     ; add  r0, r0, r3   ; thrsw
nop                  ; nop
nop                  ; nop
nop                  ; nop               ; ldtmu.r1
or  r5rep, r1, r1    ; nop
or  rf32, r5, r5     ; nop
nop                  ; mov  r5rep, r1
or  rf33, r5, r5     ; nop

...

nop                  ; mov  r5rep, r1
or  rf46, r5, r5     ; nop
or  tmua, r0, r0     ; add  r0, r0, r3   ; thrsw
nop                  ; mov  r5rep, r1
or  rf47, r5, r5     ; nop
nop                  ; nop               ; ldtmu.r1
add  r0, rf32, 15    ; nop

The same case after the fix:

or  tmua, r0, r0     ; add  r0, r0, r3   ; thrsw
nop                  ; nop
nop                  ; nop
nop                  ; nop               ; ldtmu.r1
or  r5rep, r1, r1    ; nop
or  rf32, r5, r5     ; nop
nop                  ; mov  r5rep, r1
or  rf33, r5, r5     ; nop

...

nop                  ; mov  r5rep, r1
or  rf46, r5, r5     ; nop
nop                  ; mov  r5rep, r1
or  rf47, r5, r5     ; nop
add  r0, rf32, 15    ; nop
Terminus-IMRC commented 4 years ago

Thanks!