The DTCM stack problem - Githubissues

asiekierka commented 5 months ago

First, to explain "the DTCM stack problem", I'm going to need to use a simplified model of the ARM9 memory space.

Generally, on ARM9, the "main memory" space stretches from 0x2000000 to 0x2FFFFFF - sixteen megabytes on DSi, four megabytes mirrored four times on NDS. Between 48 and 64 KB from the end of this space - that is, 0x2FF0000 to 0x2FF3FFF, DTCM is placed - a 16KB area of fast memory visible to the CPU. This space houses the following things:

user variables in DTCM (DTCM_BSS etc.), if any,
a small reserved area for the BIOS IRQ handler,
the stack: supervisor, IRQ (both bounded), and finally the user stack.

Let's look at the memory situation when user variables are not present:

The memory area around DTCM without any user variables in DTCM.

The idea of overlaying DTCM on top of main memory is sound - while it does mean accessing that 16KB of memory becomes trickier, it allows the stack to grow past DTCM into slower (especially on NDS where it is uncached) main memory.

However, what happens if we introduce user variables to the mix?

The memory area around DTCM with user variables in DTCM.

Oh no! The stack is now bounded, and quite small at that - if we use 8KB of DTCM for user data, for example, that makes our stack limited to slightly over 7.5KB.

There's a few ways to solve, or work around, the DTCM stack problem:

Stop libnds from allocating variables in DTCM - currently, only the cothread mechanism places a variable there. This doesn't solve the problem at all, but it at least means programs which do not otherwise use DTCM will have a less bounded stack. This could be a good workaround for release 1.3.0, though.
Move DTCM outside of main memory. If the stack can't grow past it anyway, there's no reason for it to be there, and we can have an additional 16KB of heap memory available. Great! This likewise doesn't solve the problem, of course.
Place user variables at the end of DTCM, before the stack. This is the solution, but there's a catch.

The catch is that the GNU linker does not support placing a section at the end of a memory region, only at the beginning - and it provides no (reliable) way to figure out the section's size before allocating it. This solution would thus require a more complex approach to linking than just calling ld: the size of the DTCM section would need to be calculated first (perhaps with a "stub" linker script which only allocates DTCM variables), and only then could one do the final link. This, then, necessitates writing a linker wrapper that BlocksDS programs use to link.

As a side-note, while DTCM is being discussed - why not move it to 0x2FF4000? This space houses the devkitARM-standard bootstub, and is thus reserved and only used when the homebrew application is exiting - the DTCM could be relocated in such a situation to access the bootstub, while programs writing their own bootstubs could simply use an uncached pointer not covered up by DTCM. This would, then, unlock that additional 16 kilobytes of heap for user programs.

profi200 commented 5 months ago

The linker issue could be solved by moving the stack top pointer outside the linker script. I can understand why you want to manage everything with the linker script but i think this is a reasonable workaround.

Another idea is to split DTCM into 2 memory regions in the linker script and assign the stack to the first region and bss to the second.

asiekierka commented 5 months ago

The linker issue could be solved by moving the stack top pointer outside the linker script.

It wouldn't. You still can't place the actual user variables at the end of DTCM, because you don't know how big the user variable section is until after it has been allocated - unless you use a two-pass strategy.

Another idea is to split DTCM into 2 memory regions in the linker script and assign the stack to the first region and bss to the second.

That is along the lines of my plan on how to implement it: the link script gains a symbol named along the lines of dtcm_data_region_size, which defines the size of the data region.

For updated Makefiles, which use a linker wrapper, the wrapper calculates the size of this region and forwards it to the proper linker.
For legacy Makefiles, that symbol is set by default to 0, which configures the link script in a way which mimics the old behaviour with any user-provided DTCM BSS at the beginning (as opposed to when the value is > 0).

profi200 commented 5 months ago

Ah, i wrote that without giving it a second thought. If the data could somehow be allocated at runtime on the stack and this DTCM bss region is removed it would solve the issue.

asiekierka commented 5 months ago

This would require at least one layer of indirection (or something like runtime relocation - scary!), and thus nullify a chunk of the benefit of using DTCM in the first place.

asiekierka commented 2 months ago

The first part of this - adding a configurable __dtcm_data_size link-time argument - is done in https://github.com/blocksds/sdk/pull/202

asiekierka commented 2 weeks ago

There's another problem here.

While everything is fine in TWL mode, in NTR mode the mirror at 2C00000..2FFFFFF that the extended stack would end up using is uncached, which isn't all that great.

blocksds / sdk

The DTCM stack problem #166