Closed Terminus-IMRC closed 3 years ago
It turns out that the TMU can read/write at most four vectors at a time, and can prefetch into TMU cache without an explicit wait (see nir_to_vir.c in Mesa for the usage), which increased the memory performance by a few percent:
summation: 7010 MB/s -> 7242 MB/s (+3.3%) memset: 3739 MB/s -> 3752 MB/s (+0.35%) scopy: 2386 MB/s -> 2424 MB/s (+1.6%)
Yes, I'll have a try and notify you when it's done.
@notogawa Added some tests, so can you review this pull request again please?
Thank you for reviewing!
It turns out that the TMU can read/write at most four vectors at a time, and can prefetch into TMU cache without an explicit wait (see nir_to_vir.c in Mesa for the usage), which increased the memory performance by a few percent: