Need to know available SRAM and SDRAM at runtime

electro-smith / libDaisy

Hardware Library for the Daisy Audio Platform

https://www.electro-smith.com/daisy

MIT License

331 stars 139 forks source link

Need to know available SRAM and SDRAM at runtime #242

Open grrrwaaa opened 4 years ago

grrrwaaa commented 4 years ago

[EDITED]

In testing exports to Daisy I discovered 2 issues that call for a runtime method of knowing available space in SRAM and SDRAM.

1.) Using internal SRAM memory is significantly faster than using SDRAM. For example, a reverb patch requiring 240kb of delay memory runs at 42% of available cpu time when using SDRAM memory, but 12% of available cpu time when using SRAM.

2.) If memory allocations using malloc() etc. (which go to SRAM currently) run out of memory, the Daisy appears to simply freeze.

For this reason it would be preferable to have a memory allocator that can use SRAM as much as possible (for speed), and defer to a block of SDRAM when this is exhausted (so that user code does not mysteriously fail).

I was able to hack something along these lines using a pre-allocated block of SDRAM (thanks to Stephen Hensley for the DSY_SDRAM_BSS macro tip), diverting memory allocations to pull from this block if more than a certain amount has already been allocated from SRAM. But, this hack is based on guesswork about how much SRAM is available, and isn't really scalable. It happens to work for gen~ export code and the examples I've tested so far, but it probably won't work for all examples and all platforms and will fail mysteriously.

Knowing available SRAM may not be feasible at compile time, since there may be runtime calls to malloc()/new in Daisy code or binding code, thus pre-assigning RAM sources may not be feasible for user-level code.

Stephen suggested a memory allocator used across libdaisy and projects covering

dsy_malloc(), dsy_free(), dsy_realloc(), dsy_malloc_size()

And the use of these methods throughout codebases that use Daisy. These routines would allocate memory from internal SRAM while such memory is available, since it is significantly faster, then defer to slower SDRAM thereafter; and return freed memory accordingly. Obviously some kind of optimized allocator scheme would be preferred (there are a few open source real-time oriented allocators out there to consider).

TheSlowGrowth commented 4 years ago

And the use of these methods throughout codebases that use Daisy.

No, please not.

This sort of "pseudo-intelligent" memory allocation is not a good idea. Users need to understand which data must go into which RAM and why. With such an automatic malloc you quickly get into situations where the order of operations (= order of memory allocations) has an impact on the performance. This is only asking for trouble.

In most cases, if you use dynamic memory allocation on a bare metal embedded device, you've picked the wrong design pattern. The use cases where dynamic memory allocation is actually needed are very rare.

grrrwaaa commented 4 years ago

Just to be clear, we're not talking about continuous dynamic allocation here (at least in my case for gen~ export there is no memory allocation after the audio starts). Just talking about the fact that calling malloc() will always go to SRAM, which when full, will simply fail; instead of failing, I think it would be better to default to using SDRAM once SRAM is exhausted. Since SRAM seems to be 3-4x faster than SDRAM we should privilege using it as much as possible. From private discussions with Stephen the only solution to know when this happens is to count memory allocations, which means applying this rule globally AFAWCT. If there's a different way to know how much internal SRAM remains available, then this would obviate the need for global dsy_malloc type solutions.

For a platform in which users can export projects from arbitrary tools (faust, pd, gen, etc.) there's going to be a desire for the system to allocate memory automatically for optimal speed without failing, and right now that's not possible.

TheSlowGrowth commented 4 years ago

there is no memory allocation after the audio starts

As I said, then you probably don't actually need dynamic allocation in the first place.

I quickly looked at some gen~generated C++ and it seems you can allocate it statically, if you're willing to change a few lines of the generated code.

grrrwaaa commented 4 years ago

The question is knowing how much SRAM is available, so that allocations can go there as much as possible, since it is 3-4x faster than SDRAM. This value is not known at compile time, since there are also allocations happening within libdaisy etc. (some of which might occur after audio starts)

Will update title & original detail accordingly.

TheSlowGrowth commented 4 years ago

I understand what you are trying to achieve (at least I think). But I don't think it will work the way you expect it to. If you have a malloc that can decide if it allocates from internal or external RAM - that's great. But it won't help you with the gen~ code.

I'm not very familiar with gen~ but from what I've seen, it collects all its state data in a large struct (EDIT: this is not true) which it creates dynamically with operator new(). Assuming you have your "intelligent" malloc, chances are that your gen~ code is too large to fit into SRAM. As SRAM and SDRAM don't sit in adjacent memory regions, your malloc will have to allocate the whole block for your gen~ code in SDRAM.

Now how does dynamic allocation help you here? All it does for you is:

You don't know at compile time if your large gen~ patch will fit into the RAM at all
You don't know at compile time in which RAM the patch will be allocated (== you can't make any guesses about performance)

If, instead, you modified your gen~ create() function to use placement new() and provided a std::aligned_storage<> as a function argument, then you can place this std::aligned_storage<> in whatever RAM seems appropriate and see right at compile time it it will fit.

EDIT: Too bad, it seems gen~ just can't be used without dynamic allocation. In this case, your best bet is to write your own allocator that "allocates" memory from a fixed size block that you place in whatever RAM you want. At least you can check at runtime that your block is large enough and - if not - blink a user led and abort.

grrrwaaa commented 4 years ago

Hi thanks for the detailed attention!

gen~ data isn't all stored in a large struct. The majority of parameter variables, filter state variables, latched variables, etc. are local to the struct, as they are likely to be referenced continously and are thus prioritized in memory location, but any large items (which means only delay lines and sample buffers) are allocated separately as they are more likely to be accessed randomly. There's almost no chance that the main struct will be too big for SRAM (I seem to be able to get away with about 240Kb, which I'm pretty confident is more than enough for even an extremely complex patch). The buffers and delays get allocated separately to this struct, and at a later stage of initialization.

It's just a question of where these buffers and delays get allocated. Ideally, as many that can fit into SRAM as possible, for better performance, hence the question.

In this case, your best bet is to write your own allocator that "allocates" memory from a fixed size block that you place in whatever RAM you want. At least you can check at runtime that your block is large enough and - if not - blink a user led and abort.

That's exactly what I have working already for SDRAM (which has to be pre-allocated anyway). It's just that I'd rather use as much SRAM as I can before doing so, since it is 3-4x faster. Hence needing a way of knowing how much SRAM remains available.

(For reference, a Dattoro style reverb can fit entirely in 230Kb of SRAM, but the Gigaverb is much larger, needing 1632Kb -- or 240Kb of SRAM and 1402Kb of SDRAM, which also works, and saves a significant % of cpu in audio callback performance compared to allocating all delays in SDRAM. But honestly picking 230Kb as my threshold is just guesswork.)

TheSlowGrowth commented 4 years ago

gen~ data isn't all stored in a large struct. The majority of parameter variables, filter state variables, latched variables, etc. are local to the struct, as they are likely to be referenced continously and are thus prioritized in memory location, but any large items (which means only delay lines and sample buffers) are allocated separately as they are more likely to be accessed randomly.

Thanks for the clarification. In this case, you'll have to live with the dynamic allocation. Too bad.

It's just that I'd rather use as much SRAM as I can before doing so

No one stops you from having two preallocated blocks - one in SRAM, one in SDRAM - and using these in your custom allocator. You only need a single head index that contains the index of the first unused byte in your two blocks. For each allocation, you check if head + numBytes <= sram_size and if so, "allocate" from the SRAM block, otherwise from the SDRAM block. Remember the alignment requirements.

At this point, we're talking about how to supply memory to gen~. My original point is still valid: You don't actually need true dynamic allocation. Your custom allocator can be used for gen~ only, and it can enter a fault state (= blink user LED, etc.) if it overflows. You should keep the rest of the system as static as possible to avoid all the trouble that comes with dynamic allocation.

Maybe I should rephrase my main concerns: If we add an intelligent malloc() to libDaisy, we encourage avoidable dynamic allocation and cause more problems than we solve. Keeping malloc() intentionally dumb forces users to at least think about how the memory model should work. We can still add custom allocators if needed, but they should come with big warning signs.

sletz commented 4 years ago

Interesting issue that we will have to think about also with Faust.

In Faust DSP are basically a class where everything (controllers, delays line...) is currently allocated at the same level. We may want to refine this a bit and use a 2 level memory layout in SRAM and SDRAM as explained by @grrrwaaa with gen~.

Since several projects have the same needs, what could be interesting to have libdaisy provide some generic memory primitives to allocate separately in SRAM and SDRAM.

grrrwaaa commented 4 years ago

Thanks for the clarification. In this case, you'll have to live with the dynamic allocation. Too bad.

I don't see it as bad :-) ... it's more flexible (and was designed to be, since gen~ targets desktop as well) -- we get to insert whatever allocation scheme we need by overriding the genlib_malloc calls that all the exported code goes through -- and more importantly, we can separate allocation of larger random-access chunks like buffers from tighter-loop persistent data, which I think will work very well for the Daisy platform.

If nothing else, perhaps this is what libDaisy might also do -- have a dsy_malloc which is initially #defined to malloc, but can be replaced. I notice that some of the libraries embedded in libdaisy have this equivalent (ff_malloc, usbd_malloc, etc.). That would allow injection of allocators by platforms as needed but will perhaps allay concerns about encouraging dynamic allocation.

Returning to my specific problem: In my Daisy work I do already pre-allocate memory stores in SRAM and SDRAM and have a custom allocator draw from those -- it does work (and gets me significantly better performance than naively allocating all buffers from SDRAM), but my problem is that I don't know how much SRAM I can safely reserve for the user-level gen~ code, which makes me worried that this custom allocator may be liable to break unrepdictably. I tried a quick test by attempting to malloc (and then free) blocks of 1Kb until malloc returns null -- at the start of main() it reports 511Kb available by this method. Unfortunately this test isn't viable in practice as it also seems to crash the Daisy thereafter.

As @sletz notes, this need to know available memory and to allocate in SRAM when possible and defer to SDRAM when needed probably isn't specific to gen~ and will likely be something that other projects will also want, thus might warrant being written once within libdaisy where it can benefit from community input, rather than individually for each project.

TheSlowGrowth commented 4 years ago

have a dsy_malloc which is initially #defined to malloc, but can be replaced.

I would prefer weak linkage, but yes, that could be a solution.

TheSlowGrowth commented 4 years ago

I have today tried to get C++ code generated from the SOUL language to run on the Daisy - and compare it to handwritten C++. Fun fact: I can't reproduce your results regarding the "3-4x" speed difference. I see more like 8% increase in CPU render time. At the same time, the code seems way slower than I would assume - the simple reverb I used is barely realtime capable. It all seems fishy. If you want to take a look, see this (includes the code that I used): https://github.com/soul-lang/SOUL/issues/33#issuecomment-670977536

Weird thing is: I've heard similar speed difference numbers stated by other people so I'm starting to think my Daisy is not running at the full speed for some reason. Maybe my libDaisy is tweaked to death? The link above contains my code, if you could run it on your Daisy and let us know the speed differences, that would be very interesting to see.

Stephenmccaul commented 4 years ago

SDRAM performance is extremely dependent on access patterns. If you access data more or less linearly and keep within the constraints of DCACHE it can be pretty fast. Any sort of random access pattern can be more than 10x slower than DTCM. Most reverb implementations are generally very bad for cache coherency as they access a large number of samples from a wide range of addresses to compute one output sample. Its possible but pretty tricky to re-order reverb delay accesses to be more linear.

We use an allocator that allows us to choose which of the 9 types of H7 ram an allocation will occur from and there is a "pick the fastest ram the allocation will fit in" option as well (fastest with respect to cpu access). We do not support freeing of ram (or use of any general purpose heap allocation). We have templatized versions that support typical C++ type initialization (constructors) as well. The trickiest part is the interface to the linker as you need to know that you are not allocating from an address that the linker has put something else in already.

auto* mem = mem_h7::get();
verb      = mem->alloc<verb_t>(H7_DTCM_D0);
delays[0]  = mem->alloc_fast<float, delay_count>();
delays[1]  = mem->alloc_fast<float, delay_count>();
...

andrewikenberry commented 3 years ago

Knowing available SRAM and SDRAM at runtime