devkitPro / wut

Let's try to make a Wii U Toolchain / SDK for creating rpx/rpl.
zlib License
244 stars 52 forks source link

Replace the CafeOS default heap with a custom one #221

Closed GaryOderNichts closed 2 years ago

GaryOderNichts commented 2 years ago

Summary

Wut currently uses wutmalloc as a wrapper around the default CafeOS heap functions (MEMAllocFromDefaultHeap / MEMFreeToDefaultHeap). This default heap is really slow for large amounts of allocations, which causes lots of slowdowns. A lot of retail games use a fast custom heap to prevent this issue. This draft uses the malloc implementation in newlib instead and replaces the default heap functions with a wrapper around the newlib functions (see wutdefaultheap). This is currently marked as a draft since it's a somewhat major change and there might be potential issues resulting from this which I haven't thought of.

RPX files

RPX files now implement and export a __preinit_user function, which will be called before any allocations are done to allow replacing the MEMAllocFromDefaultHeap / MEMFreeToDefaultHeap functions (see memdefaultheap.h).

In the preinit call wut allocates all of the available space in the MEM2 heap for sbrk. It then initializes wutdefaultheap which will replace MEMAllocFromDefaultHeap / MEMAllocFromDefaultHeapEx / MEMFreeToDefaultHeap with wrappers around the newlib functions. This results in CafeOS functions allocating from the newlib heap instead.

Overriding this behavior

The user can override this behavior by implementing their own __preinit_user function. This will skip the sbrk and wutdefaultheap initialization, and __init_wut_malloc can be called which results in linking in the old wrapper around the default heap. See this code for an example.

RPL files

Since RPL files don't support __preinit_user (and shouldn't mess with the default heap), they will simply use wutmalloc which results in allocations from the heap, which RPX has set up.

Speed comparisons

For testing the speed I wrote a simple tool, which does a lot of heap allocations of various sizes, frees them, and displays the times they took. This tool is probably not the best for accurate timing, but should be enough to show the performance increase.

Using the default CafeOS heap:

Allocations Free
malloc 51321 µs 25250 µs
memalign 44195 µs 29695 µs

Total time: 150461 µs

Using the custom newlib heap:

Allocations Free
malloc 6265 µs 3410 µs
memalign 15019 µs 3676 µs

Total time: 28370 µs

This is roughly a 5x total time improvement.

GaryOderNichts commented 2 years ago

This has been tested in several applications and it seems to work fine so far. Going to mark this as ready and waiting for a code review now.

GaryOderNichts commented 2 years ago

Just pushed a commit which uses a spin lock instead of a mutex, which improves speeds even more.

Using the newlib heap (with an OSUninterruptibleSpinLock):

Allocations Free
malloc 4066 µs 1994 µs
memalign 8536 µs 2296 µs

Total time: 16892 µs

This is now almost a 9x total time improvement over the CafeOS heap.