dciabrin / ngdevkit

Open source development for Neo-Geo
GNU Lesser General Public License v3.0
262 stars 26 forks source link

Optimize C runtime initialization for speed over size #85

Closed KScl closed 1 year ago

KScl commented 1 year ago

Instead of clearing or overwriting memory one byte at a time, take advantage of the fact that #__bss_start_in_ram and #__data_start_in_ram are already guaranteed to be long-aligned, reserve d1-d7 and a0, and movem from them directly into the correct locations. This speeds up the initialization sequence for larger programs by a couple orders of magnitude.

Because this code works in blocks of 32 bytes only, some chaff is left over at the end of user memory (from 1 to 32 bytes worth). I don't consider this an issue, because:

The BSS segment is also initialized before the data segment, to avoid a bug that the original code could occasionally encounter: Because it was off by one in its calculations, clearing out the BSS segment could result in clobbering the first byte of the data segment, if there was no padding between the end of the BSS segment and the start of the data segment (aka: if the length of the BSS segment was a multiple of 4).

In addition, the watchdog is only kicked once at the end of each loop, because they now are fast enough that the watchdog resetting the system is no longer a concern in even the worst cases, and it confers yet another speedup.

dciabrin commented 1 year ago

I like the idea of improving the speed of the C-runtime initialization, although I think we should put additional effort in cleanuping up the boundaries which aren't multiple of 32 before and after the optimization.

Also, are we sure that the loop is fast enough to not trigger the watchdog on real hardware? I haven't looked at https://wiki.neogeodev.org/index.php?title=Watchdog yet, but I think we probably need to double check that the max time it takes to init the C runtime is withing the numbre of allowed cycles.

In case it's not, how much of a slowdown would writing to the watchdog for every loop iteration incur?

KScl commented 1 year ago

So, my best estimation for the absolute worst case scenario for both loops (writing from the start to end of user memory, from 100000-10f300 — realistically this would overwrite the stack, but you get the idea here):

... for a total of 175,146 cycles for .bss initialization, and:

... for a total of 322,942 cycles for .data initialization.

Kicking the watchdog inside of either loop adds an additional 16(n + 1) cycles, or 31,120 cycles in the above worst case.

From both the wiki and MAME's source, the watchdog sets /RESET low after roughly 1,622,015 cycles (3,244,030 cycles on MCLK). Neither loop gets close to that point, and the watchdog is kicked immediately after each loop completes.

If it's desired, I can make the .data initialization end slightly early and copy over the last few bytes one-by-one so it doesn't leave anything extra in memory. I still don't see the point in it myself, since the heap could still potentially contain garbage completely out of control of the user's program (leftover memory from a game in another slot, for instance), but it shouldn't cost that much additional time.

KScl commented 1 year ago

I went ahead and altered the loop in a747e29 so that there's no longer any chaff left on the heap at all, leaving memory clean at the end of the .data section. This results in the .data section initialization taking anywhere from 110 cycles fewer (the length of .data is a multiple of 32, the extra loop gets skipped) to 572 cycles more (31 extra bytes need to be written after the initial 32-byte block loop); a negligible amount of time in the end.

dciabrin commented 1 year ago

LGTM, thank you for your thorough explanation regarding the watchdog, and for your contribution.