Hydr8gon / sm64

A port of Super Mario 64 for the DSi
Creative Commons Zero v1.0 Universal
127 stars 10 forks source link

Performance Optimization: Using TCM and Wram #33

Open AngelTomkins opened 4 months ago

AngelTomkins commented 4 months ago

The current code does not use any of the TCM. This means that the instruction cache and data cache are overwritten more often and leads to more reads to main ram. This is a fairly free performance improvement that putting the most commonly run functions into the ITCM, and putting data that is used only by the arm9 cpu into the DTCM would improve some performance.

I have experimented with this and it is not a night/day improvement for filling the ITCM with commonly run functions. Likely due to cache coherency, so there should be a method of finding the best functions based on how commonly they are run, and if the function is run before the instruction cache is overwritten.

The DSihas 800KiB of wram putting data and functions there would mean faster read times and less stalling execution due to the arm7 reading audio data on the main bus. The comparison of read speeds from the wram and main ram can be seen here. The current arm9 instructions are about 800KiB in arm mode, and if we use thumb mode it is around 600KiB, which could fit into the wram. There is a basic implementation of wram in Blocksds.