Faster examples with an extended TMU usage

Idein / py-videocore6

Python library for GPGPU programming on Raspberry Pi 4

https://idein.jp

GNU General Public License v2.0

247 stars 28 forks source link

Faster examples with an extended TMU usage #53

Closed Terminus-IMRC closed 3 years ago

Terminus-IMRC commented 3 years ago

It turns out that the TMU can read/write at most four vectors at a time, and can prefetch into TMU cache without an explicit wait (see nir_to_vir.c in Mesa for the usage), which increased the memory performance by a few percent:

summation: 7010 MB/s -> 7242 MB/s (+3.3%)
memset: 3739 MB/s -> 3752 MB/s (+0.35%)
scopy: 2386 MB/s -> 2424 MB/s (+1.6%)

Terminus-IMRC commented 3 years ago

Yes, I'll have a try and notify you when it's done.

Terminus-IMRC commented 3 years ago

@notogawa Added some tests, so can you review this pull request again please?

Terminus-IMRC commented 3 years ago

Thank you for reviewing!