WebAssembly / design

WebAssembly Design Documents

http://webassembly.org

Apache License 2.0

11.39k stars 693 forks source link

Wasm needs a better memory management story #1397

Open juj opened 3 years ago

juj commented 3 years ago

Hi all,

after a video call with google last week, I was encouraged to raise a conversation here around issues we at Unity have with Wasm memory allocation.

The short summary is that currently Wasm has grave limitations that make many applications infeasible to be reliably deployed on mobile browsers. Here I stress the word reliably, since things may work on some devices for some % of users you deploy to, depending on how much memory your wasm page needs, but as your application's memory needs grow, the % of users you are able to deploy to can dramatically fall.

These issues already occur when the Wasm page uses only a fraction of total RAM of the device. (e.g. at 300MB-500MB)

These issues have been raised as browser issues, but the underlying theme is recognizing that the wasm spec is not robust enough for mobile deployment to customers.

These troubles stem from the following limitations:

No way to control in a guaranteed fashion when new memory commit vs address space reserve occurs.
No way to uncommit used memory pages.
No way to shrink the allocated Wasm Memory.
No virtual memory support (leading applications to either expect to always be able to grow, or have to implement memory defrag solutions)
If Memory is Shared, then application needs to know the Maximum memory size ahead of time, or gratuitously reserve all that it can.

So basically Wasm memory story is "you can only grab more memory, with no guarantee if the memory you got is a reserve or a commit".

These are not particularly newly recognized issues, the memory model has been the same since MVP, and we have been dealing these ever since early asm.js days, but now that applications are becoming more complex and developers' expectations on what types of applications they want to deploy on which devices is growing, and developers are actually aiming to ship to paying customers, where reliability needs to be near that 100%, we are seeing hard ceilings on this issue in the wild.

Note that listing the limitations above is not implying that fix would be for wasm spec to somehow add support to all of these, but to set the stage that these are the limitations that exist, since their contributed combination is what causes headache to developers.

The way that Wasm VM implementations seem to tackle these issues is to try to be smart/automatic under the hood about reserve vs commit behavior, and esp. around shared vs non-shared memory. However it is still the application developer's responsibility to concretely navigate the app in the low-memory landscape, and this leads to developers needing to "decipher" the VM's behavior patterns around commit vs reserve outside the spec. For an example of the vendor-specific suggestions that this leads to, see https://bugs.chromium.org/p/chromium/issues/detail?id=1175564#c7 .

On desktop, the Wasm spec memory issues have so far fallen in the "awkward" category at most, because i) all OSes and browsers have completed migration to 64-bit already, ii) desktops can afford large 16GB+ RAM sizes (and RAM sizes are expandable on many desktops), and iii) desktops have large disk sizes for the OS to swap pages out to, so even large numbers of committed pages may not be the end of the world (just "awkward") esp. if they go unused for most parts.

On mobile, none of that is true.

Note that wasm memory64 proposal does not relate or solve to this problem. That proposal is about letting applications to use more than 4GB of memory, but this issue is about Wasm applications not being able to safely manage much smaller amounts of memory on mobile devices. (the opposite is probably true, attempting to deploy wasm64 on mobile devices would cause even more issues)

Currently allocating more than ~300MB of memory is not reliable on Chrome on Android without resorting to Chrome-specific workarounds, nor in Safari on iOS. As per the suggestions in the Chromium thread, applications should either know up front at compile time how much memory they will need, or gratuitously reserve everything that they can. Neither of these suggestions is viable.

Why Wasm requires developers to know the needed memory size at compile time

The Wasm spec says that one can conveniently set initial memory size to what they need to launch, and then grow more when the situation demands it. Setting maximum is optional, to allow for unbounded growth. On paper this suggests that developers might not need to know how much they need at compile time.

Reality is quite different, for the following reasons:

in the wild we have reports that memory allocation success rate can be better when initially allocate K MB, versus if you first allocate less, and later try to grow to K MB. The conversation in https://bugs.chromium.org/p/chromium/issues/detail?id=1175564#c7 also suggests that.
if shared memory is used, one does need to know an upper bound for the maximum memory usage.
since an application will need to account for the largest memory usage it may need (or it will fail at some point of its lifetime), practically initial == maximum memory.
one cannot set a gratuitous upper bound, since that can fail the allocation,
one cannot probe the largest upper bound that works in practice, since that can suffocate the browser or other JS allocations to fail.

In practice, especially on memory constrained devices, the current spec necessitates developers to somehow "just know" how much memory will be needed.

Why expecting developers to set memory size at compile time is not feasible

With respect to memory usage patterns, there are generally three types of apps/app workloads:

1) app workloads that use an unknown amounts of memory (AutoCAD/OpenOffice/etc document editors with "bring your own workload") 2) app workloads that use varying amounts of memory ("game menu needs 100MB, game level 1 800MB, game level 2 400MB, etc.") 3) app workloads that need a known constant amount of memory,

App developers cannot know the wasm memory size of apps of first type. To enable everyone's work size, they must generally reserve everything they can, and this has problems:

if one sets a huge 4GB max memory size, the VM may not allow that allocation, failing the app from starting even for users that would have only needed 1GB,
if one probes the largest max memory size that VM will accept, such probing can cause the browser to kill the page immediately, because it thinks it is using too much memory, or if it succeeds, it can cause the app to fail later on some other JS memory allocation since the wasm allocation took up all the available memory for the web page. (or it can cause the browser itself to fail, as we saw with Chrome GPU thread on Epic UE4 Zen Garden)
even if an application does find the suitable size to avoid the above issues, after the user unloads their document, the web page is unable to release that memory back to the system. On desktop the thinking may be that it does not matter, but on mobile it is critical to be able to release unused memory back, or the OS will be more eager to kill you when task switching.

App developers of type 2) share much of the above problems that apps of type 1) have, but one might argue they should be expected to be able to find the max needed size throughout their app lifetime and allocate that, but finding that limit can be hard work, and you may not be able to do it with 100% certainty.

Or developers of apps of type 3) might certainly be expected to choose the right needed amount and be happy with it. Initially it sounds like developers who have an app of type 3 can profile their apps to come up with a suitable initial memory size and never grow. However this has issues:

sometimes you don't know if your app certainly is of type 3). Hence you might allocate an initial K MB, but choose a maximum of K+delta MB to account for unexpected growth. This can cause failures to your app when you do need to grow, since the mobile device might fail the growth. (but it might have succeeded had you chosen initial:K+delta in the first place). Same goes for apps of type 2)
because profiling memory usage can be hard, or it may be something developers don't know how to do, application developers may choose to just allocate everything they can to "remove a problem" without being aware of the consequences. We routinely see this in practice, where e.g. on itch.io you can see simple 2D games that run with a 1.5 Gig Wasm heap of which most is unused. There is uncertainty if that is wasted committed memory, or just reservation, because the spec gives no guarantees. Then they complain that web browsers/wasm is crap when their game doesn't work on mobile.

Android app switching is a major Wasm usability pain

The documentation at https://developer.android.com/topic/performance/memory-overview at the very bottom of the page states:

Note: The less memory your app consumes while in the cache, the better
its chances are not to be killed and to be able to quickly resume.

It is a common game development QA test to perform "fast app switching" testing, which can kill game UX and player interest if it does not work. For example if a user is playing a game, then gets a WhatsApp message, they will quickly switch over to WhatsApp, type in a message, and then switch back in to the game, and expect the game to still be running. Or switch over to email, or Instagram, or whatever you have, and come back a few minutes later.

The less memory your application is consuming, the better chances you have that the page will not need to reload. With native applications this prompts the developer to push their memory usage down as much as possible when they are switched out. Mobile devices do not swap memory back to disk (at least like desktops do), but they will kill background apps if they run out of memory.

For wasm apps running in a browser, this means that for an app that has extra gig in their Wasm heap going unused because they cannot release it back to the OS, the browser will become a prime target for being killed out, and when you task switch back to the app page, the page will reload from scratch, killing fast switching.

Safari even kills you on the foreground if you allocate too much - but you have no way of knowing how much that too much is.

Some applications need address space, not memory

Native compiled wasm applications behave very similar to native applications. It is often a need for a native application to reserve a lot of address space in order to get access to a chunk of linearly consecutive memory (when existing memory allocations cannot find a linear block). Wasm applications sometimes need that too. Currently the only way to do that is to .grow() by a large amount. This means that whatever smaller bits of fragmented memory a wasm app has, can go unused, but still be committed in memory. This causes wasm apps to use more committed memory than their native counterparts.

The amount of this overhead depends on the amount of fragmentation that the wasm app causes. Most native applications have not needed to care about this for ages, but for wasm, this can be all of a sudden a huge issue. Note that memory64 proposal again does not resolve this, because it does not bring virtual memory to wasm - just changes the ISA to accept 64-bit addresses (to my best knowledge)

Summarising the problems

Reiterating, the main problems that we currently see:

wasm spec expects developers to need to know the required memory size, which is not feasible for the reasons described above,
wasm apps may need to run with large overallocated memories, leading to browser failures, JS alloc failures, or if lucky, "just" to Android app switching UX problems,
wasm apps consume more memory than native counterparts, because of memory fragmentation, lack of virtual memory, and lack of unmapping memory pages

What can be done about the problem?

In a recent video call with ARM, we discussed the (lack of) adoption of Unity3D on Wasm on ARM mobile devices, and the short summary is that these memory issues are a hard wall for feasibility of Unity3D on Wasm on Android. There have been existing conversations in #1396 and #1300 about how to shrink memory, but no concrete progress.

On the concrete bugs front, if Chrome eventually migrates to 64-bit process on Android, it can help larger than 300MB Wasm applications to work on chrome. (However an issue here may be is that manufacturers are still releasing 32-bit only Android hardware in 2020, because of old inventory stock or what - we have no idea) If Safari fixes their eager page kill behavior, maybe it will help developers gauge the max limits on iPhones. But those will not help the problem that a committed memory page is still a committed memory page, and a mobile device does have to carry it around somewhere.

Besides that, here are some ideas:

Would it be possible to make the commit vs reserve behavior explicit for Wasm? Maybe as a browser coordinated extension if not for the core spec? This would give guarantees to application developers as to what the best practices initial vs maximum vs grow semantics should be. The current situation where one browser vendor recommends to probe the max amount of memory that can be reserved, vs another browser vendor expecting that apps allocate only the minimum needed amount or be killed if they exceed that, strongly suggests that the spec is missing something to connect the expectations together.
Would it be possible to add support for unmapping memory pages from Wasm? Then e.g. Emscripten could implement unmapping of memory pages into its dlmalloc() and emmalloc() implementations, fixing memory commit issues, and the related Safari "high memory consumption" process killing, and Android task switch killing troubles?
Would it be possible to somehow make a softer version of WebAssembly.Memory maximum field? If an app allocates Memory with maximum=4gb, which risks the rest of the browser/JS losing its address space (in 32-bit contexts), then maybe the browser could start reclaiming the highest parts of that reserved address space for its own purposes if the wasm app hasn't .grow()n that memory into its own use yet?

Then if one allocated a Memory with maximum probed to as much as it can go, but then allocated a large regular ArrayBuffer, maybe the browser could just steal some of that maximum back, if the Wasm app hasn't .grow()n into it? Likewise, if there was a .shrink() operation that an app could make use of, then maybe paired with this kind of address space stealing logic, the Wasm app and the rest of the browser could coordinate to "trade" address space, depending on how much of it was actually committed in the wasm heap, vs not actually used.

I hope the impressions here will not be a "this should be left to implementation details", since when I raised these concerns as a browser implementation bug, the message was that maybe the wasm spec should address this. And currently browsers are certainly not providing common enough implementations to enable developers to succeed with Wasm on mobile devices.

Thanks if you read all the way to the end on the long post!

conrad-watt commented 3 years ago

Thanks @juj, this is a great write-up! I just wanted to add a supplementary comment, but I hope someone else can chime in with a more holistic perspective (I did read the whole thing, I'm just not qualified to respond to most of it):

Would it be possible to somehow make a softer version of WebAssembly.Memory maximum field? If an app allocates Memory with maximum=4gb, which risks the rest of the browser/JS losing its address space (in 32-bit contexts), then maybe the browser could start reclaiming the highest parts of that reserved address space for its own purposes if the wasm app hasn't .grow()n that memory into its own use yet?

IIUC, this is already permitted by the specification, since even when setting a maximum size it is permitted for memory.grow to start failing arbitrarily at a smaller size. This may tie into your point that even though the specification allows certain mitigations for memory problems in theory, browser divergences limit what applications can rely on (edit: and therefore we may need more spec guidance). I appreciate that even if every browser performed this mitigation on mobile, it wouldn't necessarily solve all the problems you bring up.

pipcet commented 3 years ago

This is a really interesting read, thank you. I may not be particularly qualified to comment on this, but my outsider's perspective is that wasm as it stands today assumes, and prohibits deviations from, a simulated physical memory model.

It should continue requiring only such a model, but allow for "full" virtual memory capabilities (with the possible exception of such pains as mapping thread-local storage into the shared address space).

This should happen in the wasm spec, rather than simply stating that all memory issues are implementation-dependent. That is because while the virtual memory model does offer near-endless possibilities, most of them can be accessed through standardized and extensible interfaces which would not be beyond the scope of such a specification. We're talking about a small number of POSIX system calls, and having reasonable fallbacks for them (such as copying rather than remapping memory).

In other words, I think this is a case where the benefits of going for a general solution outweigh the burden of implementing a few ENOSYS wrappers on low-end implementations. The initial model was way too limited, and replacing it by one that's still quite limited seems like a bad idea to me.

KronicDeth commented 3 years ago

Not having access to virtual memory and memory being committed vs reserved is one of the reasons why for the WASM target Lumen (our AoT, single-binary Erlang runtime/compile) needs to have a different memory allocator than one closer to how the BEAM VM for Erlang does memory management. @bitwalker can go into more details of the changes.

lukewagner commented 3 years ago

Hi @juj! There's a lot to address in your comment, but just to focus on the subsection "Why Wasm requires developers to know the needed memory size at compile time", w.r.t this bullet:

one cannot set a gratuitous upper bound, since that can fail the allocation,

Maybe I'm misunderstanding the problem or the current implementation strategies in Chrome/Safari, but the intention of having a separate initial and maximum is that the engine only fails when it isn't able to allocate initial; it should never fail trying to allocate more than initial. For maximum, the engine is encouraged to make a best-effort attempt to reserve some amount of memory between initial and maximum. For example, in Firefox, if reserving maximum fails, Firefox tries iteratively smaller allocations, down to initial. Assuming this implementation, it seems like Unity could set initial to some super-low value (below which it would be impossible to run in any case) and set maximum unconditionally to some gratuitously-high value.

Would that address this part of the problem? If so, perhaps we could ask the Chrome/Safari engineers if this matches their current implementation.

conrad-watt commented 3 years ago

@lukewagner one aspect of the problem mentioned in that same subsection is that, at least on V8, that approach leads to memory.grow failing more often. The OP links this bug report (https://bugs.chromium.org/p/chromium/issues/detail?id=1175564#c7) where it's stated that V8 on 32-bit platforms only allocates exactly the initial memory size and performs subsequent grows using realloc.

This ties into the point made in idea (1) towards the end of the post, that the optimal strategy for picking initial is currently different depending on the browser.

EDIT: if the Firefox implementation is aggressive in reserving as much memory/address space as it can, does it ever try to release any if it's not grown into after some amount of time (in line with my comment)? One other aspect of the OP is that Wasm programs making large reservations can cause problems for mobile devices.

kmiller68 commented 3 years ago

Would that address this part of the problem? If so, perhaps we could ask the Chrome/Safari engineers if this matches their current implementation.

JavaScriptCore only reserves the requested initial. That said, currently JSC's wasm only ships on 64-bit so VA space is less of an issue. Although, we do put WASM into a large "caged" VA space so they could out of VA there but they're much more likely to get killed by the OS before that. If we ever shipped on 32-bit we would certainly have the same issue as V8.

EDIT: if the Firefox implementation is aggressive in reserving as much memory/address space as it can, does it ever try to release any if it's not grown into after some amount of time (in line with my comment)? One other aspect of the OP is that Wasm programs making large reservations can cause problems for mobile devices.

My assumption is that FF is mprotecting with PROT_NONE, which only dirties page table data in the OS. I'm not sure what they do for 32-bit, though.

penzn commented 3 years ago

On the surface, it looks like the biggest pain point is inability to release memory. #1396 describes a workaround - reinstantiate while preserving compiled module, but that requires a high degree of compartmentalization and might not be feasible for some apps.

Shrinking memory within existing model isn't trivial. While we can grow memory by adding more pages at the end of the address space, if we do the same for shrinking it we would require defragmentation (to ensure those are actually empty), which means that a simple free won't be able to release memory. On the other hand, dropping pages from the middle would break linear indexing.

I am not sure using memory buffer for anything else would open the door for security vulnerabilities (probably not in an obvious way, but probably would require a bit of hardening), but more importantly any solution would require new instructions. I think we need a memory buffer management tied to primitives accessible from host memory management routines, which is probably close to approach 2.

There is a multi-memory proposal, maybe it would be possible to map large allocations to new memories which would get GC'd once unreferenced.

conrad-watt commented 3 years ago

I think we need a memory buffer management tied to primitives accessible from host memory management routines, which is probably close to approach 2.

How close is this to adding a GC'd reference type representing a first-class byte buffer, with operations analogous to ArrayBuffer?

Related, there is a JS proposal for a ResizableArrayBuffer, which if implemented successfully could have implications for the viability of shrink in Wasm, at least for non-shared memories/hypothetical first-class byte buffers.

lukewagner commented 3 years ago

@conrad-watt Oops, I had missed that comment, sorry.

Just to give a bit more historical background: half the motivation for adding maximum was specifically to address these tensions Jukka explains w.r.t choosing the right intial size by encouraging the reservation scheme I mentioned above. Firefox used to experience asm.js OOMs acutely on Win32, motivating maximum, and, with the maximum-reservation impl techniques (especially when combined with a fresh process), we had a significant drop in Win32 OOMs, confirmed by partner telemetry.

EDIT: if the Firefox implementation is aggressive in reserving as much memory/address space as it can, does it ever try to release any if it's not grown into after some amount of time (in line with my comment)? One other aspect of the OP is that Wasm programs making large reservations can cause problems for mobile devices.

FF clamps the max reservation size to 1gb which, in practice, seems to leave enough room for the other allocations, although I could imagine also choosing a somewhat lower clamp. It's hard to design a heuristic that knows when you've seen the "last" memory.grow, so attempting to give back the reservation could cause unnecessary late OOMs. But maybe a compromise could be to hook into the system's low-memory notification and at that point release unused virtual address space?

lukewagner commented 3 years ago

On the separate topic of shrinking: do people actually want a memory.shrink (which, given a normal fragmented malloc heap, will rarely be possible to do for any significant amount -- it seems like you'd need a custom global memory management scheme to shrink with confidence) or just some way to achieve an madvise(MADV_DONTNEED) call (which is already called by some malloc impls, like jemalloc, and in general can be more-easily adopted in an ad hoc manner).

juj commented 3 years ago

To concretely help gauge the differences in browsers on this behavior, I wrote a mobile friendly interactive memory allocation test page, available at http://clb.confined.space/wasm_grow.html (self contained HTML you can download, or run live)

Here is what I see:

Huawei P10 Plus (6GB of RAM) + Android 8.0.0 + Chrome 88.0.4324.152

new WebAssembly.Memory({ initial: 1 }); followed by Wasm grows, followed by JS allocations:

heap can be grown up to 512MB. [Chromium 1175564]❌ After that one can still allocate 767MB more of (noncontiguous) JS memory (max chunk size of 256MB) ✔️. Both Wasm and JS grow attempts later fail gracefully as JS exceptions ✔️.

Same, but specify {maximum:32767} for a gratuitous maximum reservation(?) or allocation(?):

identical result (no help from {maximum:32767}) ❌

Same as before, but also specify shared: true:

identical result (no help from shared: true either) ❌

new WebAssembly.Memory({ initial: ? });: probe maximum allocatable Wasm size, followed by JS allocations:

a considerably larger 975.938 MB heap is allocated. ✔️ If this was a reserve, then it might be fine, but based on earlier comments the impression is that this is a commit(?). On top of this, further 1.3GB of (noncontiguous) JS memory can be allocated. ✔️ Concerned what will happen with Chrome on "old browser address space" scenario under this allocation scheme.

new WebAssembly.Memory({ initial: 1, maximum: 900*1024/65536 }) to try to improve on 4) above, by reserving only up to the known max size that it was able to reach (and not a gratuitous maximum):

Wasm heap still caps out at 512MB. ❌

512MB of JS allocations, followed by new WebAssembly.Memory({ initial: 1 });, followed by Wasm grows:

(this test tries to simulate a scenario where the browser may have been open and running for a while ("old address space"), and a scenario where the page might allocate some JS content before the first wasm allocation) This is where varying results kick in: after having first allocated JS memory, sometimes Wasm heap caps out at 320MB, other times at 384MB, etc. ❌

Fast app switching test: allocate max wasm heap and 256MB of JS, and 900MB of Wasm heap, then task switch out to some other browser tabs or apps (Instagram, email) for a short period, and come back.

Browser is evicted rather immediately, causing page reload when navigating back, losing the app state. ❌

Huawei P10 Plus (6GB of RAM) + Android 8.0.0 + Firefox 85.1.3

from above:
- heap can be grown up to 2GB-64K, ✔️ after which about 1GB more of JS memory can be allocated ✔️, before browser silently reloading the page (no OOM JS exception) [Bugzilla 1693256] ❌
and 3. from above: 2GB-64K alloc ✔️
from above: 2GB-64K alloc. ✔️
N/A
Can allocate 2GB of JS memory ✔️, after which a 1GB Wasm heap allocation still succeeds ✔️. Attempting to grow wasm heap past that to 2GB will cause silent page reload with no JS exception. ❌
Eviction happens, but subjectively maybe not as fast as with Chrome. ❌

iPhone Xs + iOS Safari 13.3.1

(apologies for not testing on a newer iOS Safari, but iOS update is not working to update to newer version on this phone, and I do not have any other one to test with. I hope this data is still relevant)

new WebAssembly.Memory({ initial: 1 }); followed by Wasm grows, followed by JS allocations:

on first test, heap could be grown up to 768MB. After that one can still allocate ~700MB of JS memory, after which the browser will silently reload the page. [WebKit 221530] ❌
repeating the test by reloading the page, heap could be grown only to 544MB. After that allocating ~100MB of JS memory resulted in browser reload of the page. ❌
repeating the test a third time, heap could be grown to 512MB, and attempting to grow wasm heap even more would cause the browser reload the page. ❌

Specifying maximum: 1GB does not help, page still reloads at ~512MB-768MB ❌
No help from shared: true either. ❌
Probing initial enables a whopping 1.8593GB heap to be acquired! ✔️ But after that, allocating even the tiniest 64KB JS ArrayBuffer will cause the page to immediately reload. ❌
No help from specifying a more modest maximum either. Page reloads at 512MB. ❌
Able to allocate 512MB of JS memory, and after that the same 512MB of Wasm heap. Allocating more JS memory will cause a page reload. ❌
Eviction was noticeably harder to reproduce, but did occur after launching a bit more memory consuming apps. ✔️/❌

Summary

The aforementioned issues pop up in different forms in the tests:

not being able to get enough initial memory,
needing to overallocate initial memory to do better than .grow(),
not getting JS exceptions when memory grow fails, but page reloads/OOMs,
not being able to release memory to keep Fast App Switching alive

Being in danger of suffocating browser native address space issues would not show up in this test, mainly because this test does not call out to any memory intensive browser APIs (XHR/Fetch/WebGL/WebAudio) that might risk exhausting memory. It is hard to say how prevalent such issues are on 32-bit Chrome. Firefox had an excellent memory allocation success in this test.

Testing some of this behavior is extremely fuzzy, for two main reasons:

it is hard to capture "old browser address space" behavior in test conditions, since it almost requires one to daily drive the device and browser for a while, and then do the testing.
it is hard to come up with a solid Fast App Switching eviction test, because what people do in the wild vs "lab conditions" can be quite different.

juj commented 3 years ago

Would it be possible to somehow make a softer version of WebAssembly.Memory maximum field?

IIUC, this is already permitted by the specification, since even when setting a maximum size it is permitted for memory.grow to start failing arbitrarily at a smaller size.

one cannot set a gratuitous upper bound, since that can fail the allocation,

Maybe I'm misunderstanding the problem or the current implementation strategies in Chrome/Safari, but the intention of having a separate initial and maximum is that the engine only fails when it isn't able to allocate initial; it should never fail trying to allocate more than initial.

Just to give a bit more historical background: half the motivation for adding maximum was specifically to address these tensions Jukka explains w.r.t choosing the right intial size by encouraging the reservation scheme I mentioned above. Firefox used to experience asm.js OOMs acutely on Win32, motivating maximum, and, with the maximum-reservation impl techniques (especially when combined with a fresh process), we had a significant drop in Win32 OOMs, confirmed by partner telemetry.

Hi Luke! I recall this thread of conversation well, as I was also working with that partner collaboration. It did indeed help 32-bit Firefox to a great extent based on their telemetry. In the test scheme above, Firefox on Android performs well, and is able to allocate large heaps. (not sure if it is a 64-bit process already on Android?)

Though in the test scheme above, it looks like no browser performed any different when gratuitous maximum: 32767 passed.

Even with that recollection, this current behavior we have been seeing with maximum did still get me confused to think the implementations were attempting to guarantee satisfying the maximum, and hence failing - but the test scheme above shows that is not the case - they are failing already when that is omitted.

Although, we do put WASM into a large "caged" VA space so they could out of VA there but they're much more likely to get killed by the OS before that.

We have received some odd behavior (maybe due to this?) in Safari where people report that when they have "old browser process" (long running process/lots of tabs open?), they may fail to launch Unity pages due to OOMs or page reloads, but killing Safari process and reopening it will help a Unity game to launch again. It has been very difficult to raise a bug report about this since producing an "old browser process" in QA is quite a fuzzy and nonrepeatable procedure. (in fact, we do see get similar reports also occassionally in Firefox and Chrome, but not quite as often as with Safari)

Although now in the above test, this "shrinking" of available memory was reproduced, i.e. first page load got 768MB of Wasm heap, first page reload 544MB, and second page reload was down to 512MB. Opened [WebKit 222097] about this.

On the surface, it looks like the biggest pain point is inability to release memory.

I tend to agree, since if there is a way to release memory, then it would probably fall out of that that the initial commit vs reserve semantics would need to become well defined across implementations. One could then release all the memory that was initially committed (if it happened to cause a commit).

The memory allocation problems with test results that I have in the above post could presumably be dealt with implementation specific bugs (Chrome being 32-bit, not getting graceful JS OOM throws on large alloc failures, etc).

Shrinking memory within existing model isn't trivial. While we can grow memory by adding more pages at the end of the address space, if we do the same for shrinking it we would require defragmentation (to ensure those are actually empty), which means that a simple free won't be able to release memory.

On the separate topic of shrinking: do people actually want a memory.shrink (which, given a normal fragmented malloc heap, will rarely be possible to do for any significant amount -- it seems like you'd need a custom global memory management scheme to shrink with confidence)

On its own, a .shrink() would not be enough. Indeed it would be an opportunistic behavior where an emmalloc/dlmalloc impl could only .shrink() when the freed allocations occurred at the top of the heap, which may not be the case for many applications (and needs the programmer to be memory fragmentation aware). Although in some apps, this could "trivially" be the case when they do large transitions in application lifetime (user closes edited document, player exits a game level back to main menu), where user navigation flow has been able to guarantee this kind of stacking allocation order.

The intent with .shrink() was that maybe it could help give some address space back to a 32-bit browser, i.e. let wasm apps run at all times with the smallest heap size that they need to fit all their own memory into. Then the browser would know also at runtime how much of that gratuitously reserved maximum: 32767 address space it could reclaim, if JS side or other browser operation would cause a large JS/native allocation. I.e. to assist in not suffocating the browser's own address space. (Though maybe browsers may have hard time actually taking benefit of such opportunity in practice?)

One pragmatic thing that such a .shrink() would certainly help if nothing else, are the large number of bug reports people produce about Wasm apps consuming large amounts of memory, or having a memory leak when they enter and exit a full game scene in Unity. What people are doing is they look at their Chrome/Firefox/Safari DevTools Memory tab, and see the effects of .grow() when they enter a scene, but when they exit back and the scene is unloaded and memory cleared, they can not observe any shrink in the wasm heap in DevTools, leading them to think that a memory leak must have occurred. In other words, browser DevTooling is unable to account for the actually used memory in Wasm.

I wonder what would happen on desktop when wasm64 becomes a thing. If a wasm64 app performs a huge/maximum address space reservation, could such operation cause a 64-bit browser to be address space constrained on the native side? Could a wasm64 app be desired to be able to .shrink() the address space back to the browser? Or maybe wasm64 will still not allow an app to reserve the full 64-bit address space, but a much more modest fraction of it, so that browser still will have plenty for its own.

Btw, after seeing https://github.com/bytecodealliance/wasm-micro-runtime earlier my first thought was to wonder how they deal with the lack of .shrink() in extremely memory constrained systems that may not have a concept of virtual address space at all(?).

or just some way to achieve an madvise(MADV_DONTNEED) call (which is already called by some malloc impls, like jemalloc, and in general can be more-easily adopted in an ad hoc manner).

This would certainly be the main remedy that I can think. Instead of .shrink(), that would allow all apps to benefit, and help the Fast App Switching problems.

Orthogonally to all of this, even already today without any spec changes, I wonder if the current browser DevTools implementations could be improved to detect and display how much of the wasm heap is actually committed vs just reserved? Currently all browsers will show a huge opaque block of Memory for the Wasm allocation. It would be nice to have DevTools display a "committed size, reserved size, % committed" type of visuals, where one could then see how much memory their application is impacting in practice.

What this would help is that developers would better understand the behavior they are getting when they are doing browser specific workarounds to WebAssembly.Memory() allocation patterns. Also when writing Emscripten's emmalloc I have wondered whether the memory region marking strategy can cause excess page commits for unused memory pages that applications may not ever be using, so would be great to see how that behaves in practice.

aardappel commented 3 years ago

Some applications need address space, not memory

Somewhat related: discussion on "Support for reserving address space" in Memory64: https://github.com/WebAssembly/memory64/issues/4

I generally would be very much in support of adding features related to mmap / reservation / shrinking / probing etc. to Wasm. Besides needing them for memory constrained devices, we will also need these for the opposite: programs wishing to manage large amounts of address space.

lukewagner commented 3 years ago

@juj It seems like, if browsers did implement the FF maximum-reservation scheme, then with a 2gb maximum specified, there should be no difference in your experiment between (1) the memory successfully allocated by new WA.Memory({initial:X}) probing and (2) the size to which you can eventually memory.grow. And (2) would have the added benefit of only being reserved memory.

Jukka, would a good deal of your needs be addressed if:

all browsers implemented the maximum-reservation scheme
there was a discard instruction (as mentioned in future features and briefly entertained as an MVP instruction)

penzn commented 3 years ago

How close is this to adding a GC'd reference type representing a first-class byte buffer, with operations analogous to ArrayBuffer?

@conrad-watt ideally very close, I just wasn't sure how this would work in the existing memory model. Sorry, I have not been following GC proposal close enough, can an object like this be accessed as part of linear memory?

conrad-watt commented 3 years ago

The "simplest" version of (my interpretation of) this idea would be to make such buffers like any other GC object. That is, each buffer would have an entirely disjoint address space from any other (enforced by bounds checking), they'd have their own family of load/store operations, and would be stored (by reference) in a table, or as a field of another GC object, rather than in linear memory.

I wasn't sure if this was what you had in mind, or if the idea was to tie more closely to the existing linear memory (by having a host procedure to manage chunks of linear memory that are still manually accessible through regular load/store?).

juj commented 3 years ago

@juj It seems like, if browsers did implement the FF maximum-reservation scheme, then with a 2gb maximum specified, there should be no difference in your experiment between (1) the memory successfully allocated by new WA.Memory({initial:X}) probing and (2) the size to which you can eventually memory.grow. And (2) would have the added benefit of only being reserved memory.

Jukka, would a good deal of your needs be addressed if:
1. all browsers implemented the `maximum`-reservation scheme

2. there was a `discard` instruction (as mentioned in [future features](https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#finer-grained-control-over-memory) and [briefly entertained as an MVP instruction](https://github.com/WebAssembly/design/issues/384))
?

That would certainly be expected to fix the Chrome and Safari issues that allocating a large initial is better than growing from a small initial. That would also be expected to fix the Fast App Switching problem.

I am not sure if after those we will still have stability issues on 32-bit browsers, caused by a Wasm page reserving a large 2GB part of the process address space, leaving the browser with <=2GB left for its own use. Currently nothing stops a browser from gnawing back from top end of that address space if JS side does large XHRs or memory intensive WebGL operations, but if the page happened to temporarily have done a huge 2GB alloc (to grow() to consume the whole heap) but then freed all of it, that address space would then permanently be off limits for the browser to chip into. Would a .shrink() operation to enable address space stealing be too contrived to implement?

One particular detail about .discard() is the behavior that should happen when an app attempts to touch the memory to commit it again, but there is not enough memory available to commit. Regular JS ArrayBuffer allocations and Wasm .grow()s are "blocky" in that if I want to allocate e.g. 1GB, the allocation is monolithic in that 1GB, and if that fails, I should be able to gracefully get a JS exception/trap out of it, and decide to do something else. This is super-important for stability.

But touching memory to commit it will not be blocky, but will roll in one page at a time, so one might get 900MB of that 1GB reserve committed, and then run into a page that finally exhausts the available physical memory. We would not want the browser to silently reload the page like current Firefox/Safari/Chrome behavior on OOM can be. But instead, one would prefer to have a way to gracefully manage the page commit failure, and avoid the 1GB allocation altogether (and probably uncommit that 900MB from before to avoid browser small OOMing itself right after). So the exact semantics of what should happen when a page commit fails on memory store are important. (also what should happen when one attempts to load memory from an uncommitted page?)

Maybe in addition to memory store implicitly committing a page, there could be a dedicated instruction .commit(addr, length) that would commit the address range rolling from addr to length, and synchronously return the number of pages that were committed, so that it would be possible to implement memory allocators that could check that the memory it is handing out will be guaranteed available as committed to the caller (instead of the caller having to find out on its Nth memory store to the allocated region)?

Would the commit vs reserve page size be fixed (to the same 64K of the wasm page size?), or variable size depending on the underlying architecture? If variable size, can there be an instruction to query this size?

Finally, would it make sense to give applications an instruction to programmatically query a) if a given address (range?) is committed or not, and b) ask the number of committed pages total in wasm memory? Those would be nice to help implement debugging and profiling support to applications and allocators.

lukewagner commented 3 years ago

I am not sure if after those we will still have stability issues on 32-bit browsers, caused by a Wasm page reserving a large 2GB part of the process address space, leaving the browser with <=2GB left for its own use.

For these issues, I'd like to re-highlight my earlier comment (second half) about (1) assumed engine clamping of the maximum internal reservation and (2) releasing reserved-by-maximum vmem at low-memory or allocation failure notifications.

One particular detail about .discard() is the behavior that should happen when an app attempts to touch the memory to commit it again, but there is not enough memory available to commit.

That's a great point. More generally, from talking about this with @lars-t-hansen today, it seems like, on systems where a random i32.load might OOM-kill the process, the implementation of initial/memory.grow-allocation should try to eagerly and fallibly "populate" the newly-available memory. I'm not positive, but from reading the man pages, this might be achievable on Android with mmap(MAP_ANONYMOUS|MAP_POPULATE) (using MAP_FIXED for in-place memory.grow), which would hopefully fail gracefully (not crash) if the region can't be populated. (Another candidate is madvise(MADV_WILLNEED), but it's not clear if that's just a hint that won't fail in the cases we want it to fail.)

(As a side note on terminology, and I'm not sure if I'm correct here, so happy to have corrections, but, IIUC: "committed" means neither "virtual address space allocated" nor "RAM pages allocated to page table entries"; rather, it means "you can access this region without SIGSEGV, but it might not be backed by RAM, so you may have a kernel trap on access that may OOM-kill you". Given this, it seems like "committed" doesn't imply the desired property of "not crashing at random i32.loads" (c.f., Linux "overcommit"); instead you want this more subtle, ephemeral and heuristic (since presumably the kernel can do whatever it wants) concept of "populated".)

Returning to the hypothetical discard instruction (which called madvise(MADV_DONTNEED)), it seems like this would un-populate the region, and thus have the possibility of crashing on the first i32.load in the region. Incorporating your idea above, maybe there could be an additional populate instruction which took a range and returned a bool (i32) indicating "I was able to populate this region". Semantically, it would have no side-effects, but when used in conjunction with discard it could be used by a malloc impl to achieve the goal of releasing unused RAM to the system while avoiding crash-on-i32.load.

juj commented 3 years ago

(As a side note on terminology, and I'm not sure if I'm correct here, so happy to have corrections, but, IIUC: "committed" means neither "virtual address space allocated" nor "RAM pages allocated to page table entries"; rather, it means "you can access this region without SIGSEGV, but it might not be backed by RAM, so you may have a kernel trap on access that may OOM-kill you".

I must admit that I am not familiar with the Linux/Unix parlance of these terms, but I hope my use of "reserved" vs "committed" in earlier messages follows the correct semantics that Windows uses them with (https://docs.microsoft.com/en-us/previous-versions/ms810627(v=msdn.10) ).

For these issues, I'd like to re-highlight my earlier comment (second half) about (1) assumed engine clamping of the maximum internal reservation and (2) releasing reserved-by-maximum vmem at low-memory or allocation failure notifications.

In the absence of a .shrink() operation, or a way for browser to steal back uncommitted (unpopulated?) pages, it seems to be that such releasing reserved-by-maximum vmem would only work if the app was still in its initial pristine condition. Later in the app lifecycle, there may not exist any reserve left, as the application has grown to consume all of it - but would have no way of telling the browser if it is no longer actually using it or not.

It might be brittle if the reservation stealing would only work if the wasm app was still pristine, but not if it had earlier temporarily used a lot of memory.

Or maybe .shrink() is not needed and such reservation stealing would also work on unpopulated pages at the top end of the heap, where the browser could take those away in low mem scenarios even if app had .grow()n to them but later discarded the pages; and forbid the wasm app from populating any of the high pages if the browser needed to use them to avoid OOMing?

It is true that such .shrink() only from the top end type may require developers to pay extra attention to fragmentation, but I do see that as being better option, compared to the possible problems that might arise if wasm apps that have temporarily used a lot of memory can make the browser more prone to OOMing.

Returning to the hypothetical discard instruction (which called madvise(MADV_DONTNEED)), it seems like this would un-populate the region, and thus have the possibility of crashing on the first i32.load in the region. Incorporating your idea above, maybe there could be an additional populate instruction which took a range and returned a bool (i32) indicating "I was able to populate this region". Semantically, it would have no side-effects, but when used in conjunction with discard it could be used by a malloc impl to achieve the goal of releasing unused RAM to the system while avoiding crash-on-i32.load.

This sounds very good. That would help apps decide to do something else on large OOMs without risking of populating up to last available page in the browser and then failing.

What would the semantics of memory loads and stores in general be like to unpopulated pages, when there is plenty of memory available? Would each touch of a page implicitly populate under the hood? Or would it trap? It feels like either behavior could be useful, not sure which way to lean on this.

Also, would it make sense to have an instruction to switch a page to be read-write vs read-only vs noaccess? Those could be interesting to help debugging and error catching.

lukewagner commented 3 years ago

In the absence of a .shrink() operation, or a way for browser to steal back uncommitted (unpopulated?) pages, it seems to be that such releasing reserved-by-maximum vmem would only work if the app was still in its initial pristine condition.

This is where it's important to distinguish "reserved-by-maximum vmem" from "memory accessible to wasm via initial or memory.grow". The former is only reserved, not committed, and thus wasm will trap if attempting to access it. Thus, memory reserved by maximum is necessarily pristine/unpopulated, so the browser can unobservably (other than failing a future memory.grow) release it.

What would the semantics of memory loads and stores in general be like to unpopulated pages, when there is plenty of memory available? Would each touch of a page implicitly populate under the hood? Or would it trap? It feels like either behavior could be useful, not sure which way to lean on this.

Although you could imagine a trapping semantics being useful for catching bugs, this would place a major requirement on wasm engines to use signal-handler tricks to avoid costly per-memory-access checks (which not all engines can do now or in the future). That's why I proposed above that populate have no semantic side effect on linear memory or future loads/stores. You could imagine a debug-mode that caught such errors.

Also, would it make sense to have an instruction to switch a page to be read-write vs read-only vs noaccess?

Definitely agreed that these would be valuable, but the same caveat applies that implementing this feature without the benefit of memory-protection+signal-handlers would be pretty expensive. At least, that's what has held us back so far; maybe we should revisit this at some point. There's also challenging questions in the JS API for how to handle typed array views that overlap read-only or inaccessible regions.

juj commented 3 years ago

Thus, memory reserved by maximum is necessarily pristine/unpopulated, so the browser can unobservably (other than failing a future memory.grow) release it.

I do understand that, but that is not the scenario when I am concerned about browser not being able to release it.

If the wasm page temporarily uses all of the reserved max memory, i.e. .grow()s to take over it, but then later releases most of it, then the memory region would again be unpopulated, but currently the browser cannot recognize it, and cannot claim any of it to its own use. This temp large .grow() is what I am concerned about, since that cannot be undone unless a .shrink() would be supported.

That's why I proposed above that populate have no semantic side effect on linear memory or future loads/stores. You could imagine a debug-mode that caught such errors.

That does make sense. discarding a page will then be the same as memsetting it to zero?

Definitely agreed that these would be valuable, but the same caveat applies that implementing this feature without the benefit of memory-protection+signal-handlers would be pretty expensive. At least, that's what has held us back so far; maybe we should revisit this at some point. There's also challenging questions in the JS API for how to handle typed array views that overlap read-only or inaccessible regions.

Gotcha - memory protection is something that I don't see critical at all for solving mobile memory problems, so that can certainly be left out. Was just rather curious whether that would have come practically "for free" on the side.

lukewagner commented 3 years ago

If the wasm page temporarily uses all of the reserved max memory, i.e. .grow()s to take over it, but then later releases most of it, then the memory region would again be unpopulated, but currently the browser cannot recognize it, and cannot claim any of it to its own use.

Ah, that's a different case than I was replying to. For the memory.shrink case, as I was saying in my earlier comment, unless a custom memory allocation scheme was employed, I would imagine that usual internal fragmentation problems would prevent memory.shrink from helping much with malloc()+free() (e.g., if there is even 1 tiny malloc() performed after the large temporary allocations, it would prevent the memory.shrink). Are you imagining the use of such a custom allocation scheme? When I imagine how such a custom allocation scheme would need to be implemented, it seems tricky: to prevent malloc() from going after the large temporary allocations, you'd need to reserve a fixed amount of space before the large temporary allocations, and sizing this region would require some of the same hard questions (how much, with penalties for too-much and too-little) you outlined at the root of this thread.

For the general case, I think discard would get you most of what you need: releasing the physical RAM without paging -- it's only the vmem range that's not being released back to the browser. When I imagine a typical loading sequence, it seems like load-time (when the temporary large allocation is made) is the point at which vmem is most scarce, and once the app "survives" this pinch-point, it's mostly good. If the app performs a sequence of loads (e.g., loading levels), there's even a risk that, after giving vmem back after the first load, subsequent fragmentation prevents the second load from growing again. Thus, it's a question of whether the engine should even give back vmem to the browser after a memory.shrink.

The reason I push back on memory.shrink is that it opens a big can of worms for shared memory and a small can of worms for non-shared memory, so it's something I was hoping we could avoid, if indeed it has limited practical applicability.

That does make sense. discarding a page will then be the same as memsetting it to zero?

Yep!

titzer commented 3 years ago

Based on the memory management in V8 for array buffers, especially shared buffers, it would be quite difficult to support memory.shrink, so I mostly concur with @lukewagner here.

juj commented 3 years ago

I would imagine that usual internal fragmentation problems would prevent memory.shrink from helping much with malloc()+free() (e.g., if there is even 1 tiny malloc() performed after the large temporary allocations, it would prevent the memory.shrink). Are you imagining the use of such a custom allocation scheme?

Yes, indeed. I'll repeat the rationale:

Indeed it would be an opportunistic behavior where an emmalloc/dlmalloc impl could only .shrink() when the freed allocations occurred at the top of the heap, which may not be the case for many applications (and needs the programmer to be memory fragmentation aware). Although in some apps, this could "trivially" be the case when they do large transitions in application lifetime (user closes edited document, player exits a game level back to main menu), where user navigation flow has been able to guarantee this kind of stacking allocation order.

In some applications it is easily the case that on "grand scale" the allocations have good stack-like characteristics when one transitions between e.g. main menu and the game levels. Applications may need to develop custom memory pools to manage this kind of behavior, but that is not much different from wasm today.

Already in the absence of .shrink(), Wasm developers need to be mindful about memory fragmentation, so introducing a .shrink() would not change that fact.

When I imagine a typical loading sequence, it seems like load-time (when the temporary large allocation is made) is the point at which vmem is most scarce, and once the app "survives" this pinch-point, it's mostly good.

This is perhaps a bit too simplistic model. If we look at app loading flow, then under current .grow() only model, it will actually be the "second document load" (document/game level/asset/...) that will cause the most simultaneously consumed address space pressure, e.g.:

Load a large, say, 300MB, document to JS memory (e.g. from XHR or IndexedDB - former could stream, latter cannot)
wasm.grow() memory +300MB to fit the document, memcpy the document to Wasm memory
unload doc from JS memory, -300MB,
wasm.grow() a second time, for, say, +1GB to unpack/expand the doc in wasm, and fit a working memory area for processing the document
document unloads, -1.3GB of unused memory in wasm (would .shrink() here if available)
load a (the same?) document again, e.g. +300MB to JS memory, but OOM since cannot find address space for 1.3GB + 300MB simultaneously.

Outside the loading process, applications can also have large persistent JS side memory allocations long after the wasm heap has been .grow()n to its maximum size. E.g. when

working with large amounts of data with IndexedDB,
using Web Audio,
using WebGL or WebGPU,
large web requests, or
other marshalling of data between JS and Wasm.

but the size of the needed JS side memory can vary between documents/game levels, so when one level might need more of Wasm memory, another level might need more of JS memory. E.g. in Unity game specifically, if there is programmatically heavy computation (pathfinding, AI, noise, skinning, some other game C# computation) in one level, that could amplify a lot of wasm .grow()s to occur. If there is a lot of audio, or data marshalling, or cutscene videos, then there will be a lot of JS memory usage. These Wasm vs JS side maximums will not necessarily happen at the same time, but without a .shrink() operation, one cannot do anything to combat this (and should probably pretend as if these maximums did occur simultaneously).

With wasm32 at least 64-bit browsers will be immune to this, so this will be a 32-bit browser only concern. Not sure what will happen with wasm64.

The reason I push back on memory.shrink is that it opens a big can of worms for shared memory and a small can of worms for non-shared memory, so it's something I was hoping we could avoid, if indeed it has limited practical applicability.

It would certainly not be a 100% cure, since a wasm application that was not fragmentation aware would not be able to benefit. Though if one is developing a wasm page with large data sets, unfortunately one will already need to be fragmentation aware, there is no escaping that with or without .shrink().

I do appreciate the trouble with shared memories.

lukewagner commented 3 years ago

Thanks for the info @juj. The point I'm trying to dig into, though, is: even though apps may have this stack-like "grand scale" behavior you mention, that doesn't ensure that memory.shrink can be used w/o a very special allocation scheme along with app-wide coordination. (My expectation is that: without explicit global coordination, it would be very easy for tiny mallocs to creep in that break your ability to memory.shrink.) I can theoretically imagine such global coordination schemes, but it seems like potentially a big architectural change, which is why I wonder whether it would be implemented in practice.

In some applications it is easily the case that on "grand scale" the allocations have good stack-like characteristics when one transitions between e.g. main menu and the game levels.

Your loading scenario makes sense, but a slight variation shows how having the browser release vmem dynamically could be equally problematic: imagine step 5 shrinks (and releases vmem) and but then step 6 tries to perform a large new wasm allocation which now fails due to fragmentation. This seems like a difficult tension to resolve in general.

What I can imagine being a more reliable way to avoid this kind of thrashing between wasm and JS memory is to avoid pulling in large allocations directly into linear memory all-at-once by instead streaming bounded-sized chunks (from a backing Blob or ArrayBuffer) into wasm memory on-demand. (I know that's not always possible, though.)

Load a large, say, 300MB, document to JS memory (e.g. from XHR or IndexedDB - former could stream, latter cannot)

On a side note, IIUC, on both Chrome and Firefox, Blobs are not kept in memory. Thus, I think you can "stream" a Blob by .slice()ing it into fixed-size chunks that are individually .arrayBuffer()ed.

penzn commented 3 years ago

@conrad-watt sorry for taking this long to reply :)

The "simplest" version of (my interpretation of) this idea would be to make such buffers like any other GC object. That is, each buffer would have an entirely disjoint address space from any other (enforced by bounds checking), they'd have their own family of load/store operations, and would be stored (by reference) in a table, or as a field of another GC object, rather than in linear memory.

That should work as long as we can present the allocated objects as something memory-like to the consumers in the module. Do we have enough support for this in the standard or near-future proposals?

or if the idea was to tie more closely to the existing linear memory (by having a host procedure to manage chunks of linear memory that are still manually accessible through regular load/store?).

This is what I originally thought, since that is closer to how accessing memory works in the native world, though after giving it a little more thought I am not sure this would be easy to support within existing model of linear memory.

penzn commented 3 years ago

@conrad-watt's approach can be extended to support POSIX stack emulation - instead of incrementing a global "stack base" symbol on entry and decrementing it on exit, function can request an object on which would be GC'd after it exits. This would free up linear memory and prevent stack walking.

lars-t-hansen commented 3 years ago

@conrad-watt's approach can be extended to support POSIX stack emulation - instead of incrementing a global "stack base" symbol on entry and decrementing it on exit, function can request an object on which would be GC'd after it exits.

Not in any language that can take the address of stack variables, I think.

penzn commented 3 years ago

Not necessarily, if some form or reference would be considered an address it would work; also this issue would apply to heap objects too. I am not yet sure how this would work though, my speculation would be that via some combination of interface types and GC we can get an object which can be represented as a bag of bytes and then do something that resembles memory operations on it.

laughinghan commented 3 years ago

@lukewagner

The point I'm trying to dig into, though, is: even though apps may have this stack-like "grand scale" behavior you mention, that doesn't ensure that memory.shrink can be used w/o a very special allocation scheme along with app-wide coordination. (My expectation is that: without explicit global coordination, it would be very easy for tiny mallocs to creep in that break your ability to memory.shrink.) I can theoretically imagine such global coordination schemes, but it seems like potentially a big architectural change, which is why I wonder whether it would be implemented in practice.

The kind of fragmentation you're describing makes it sound like you're thinking of general-purpose allocators like dlmalloc/jemalloc, but my understanding is that it's common for games to use (for example) arena allocators (aka bump allocators). The way they work is you can't fine-grained free() individual blocks of memory at all, instead you allocate in constant time by incrementing (bumping) a pointer, and then when you're done rendering that frame, you free the entire arena in constant time by resetting the pointer. E.g.:

bumpalo appears to be a popular Rust bump allocator
Google's protobuf docs suggest arena allocation for services written in C++, with the idea being that you free the arena when you finish serving the request
one of Zig's built-in allocators is std.heap.ArenaAllocator

I think the idea with most of these is that if you need some info across frames/requests, you just store it globally; but separate arenas with different lifetimes are also a thing, typically called region-based memory management. Obviously these are indeed a big architectural decisions with app-wide implications, but I don't think they're unusual at all in practice, especially for games, which are a major use case for Wasm IIUC.

I apologize if you already know all this—most of this ticket is over my head, nor am I a game developer.

lukewagner commented 3 years ago

@laughinghan Yes, that all makes sense. SpiderMonkey, which I worked on, also uses custom allocators extensively. But ultimately these custom allocations need to exist within the global address space (in wasm's case, linear memory), and the memory.shrink optimization only works if you not only use a custom allocator but also carefully place it at end of linear memory (ensuring, e.g., that there's enough space "in the middle" so that malloc and friends never have to allocate at the end -- which raises some of the "what's the maximum (malloc) heap size" problems above). As I said, possible, just non-trivial for a realistic (esp. pre-existing) large system, hence the question.

cmuratori commented 2 years ago

I wanted to clarify some things here, since I use direct OS memory calls often, and have also been a big proponent of the "bump allocator" strategy being discussed. I think there is some confusion about the landscape of low-level memory systems, what they do and do not need, and what they implicate for WASM memory management.

Because these are long, I'll split into two posts. This post is just "what services do OSes actually provide", so it can be easily referred to. If anyone notices any important features I've left out, please reply and I'll edit this to add it.

Operations provided by VirtualAlloc/VirtualFree/MapViewOfFile/VirtualLock/mmap/munmap/mlock/etc.

When designing an intermediate memory protocol like WASM, it's instructive to consider all the features actually available at the OS level that are useful. WASM might not want to expose them all, since heterogeneous OSes would make it so there would be too many "optional" permutations in the spec. But it seems useful to list what all the things are before proceeding, so it's easy to refer to and know which ones are and aren't going to be supported, and why.

Reserve without base address

The simplest operation is to reserve an address range in the virtual address space of the process without caring where it is. Technically this only requires the operating system to ensure that space exists in the process's address space mapping tables. At least on some OSes, assuming a 64-bit address space, it will essentially always succeed if the size of the range is even remotely reasonable, because the OS does not need to actually use any resources to complete the request, other than a new entry in the table for the process. In 32-bit address spaces, however, it's entirely possible for this to fail even when the size is reasonable, due to fragmentation of the address space, when no contiguous virtual address range can be found large enough to hold the requested size.

This operation is used by applications to announce their intention to possibly use the memory in that range at some point, but the expectation is that the application may well never use all of it.

Commit with on-demand physical mapping

When an application actually wants to start using some of the address space it has reserved, it asks the operating system to prepare to store the contents of that many pages of actual memory. This is a commitment from the operating system that the memory can now be preserved somewhere.

OSes with paging (Windows, Linux, etc.) do not necessarily map physical memory when virtual addresses are first committed in this way. They may merely ensure they have the necessary space in physical memory or the page file, but defer mapping physical memory until the application page faults on an access to a particular page, at which point it will map that virtual page to a physical page and resume.

I believe (but my knowledge might well be out of date) that Windows currently fails to do a commit operation if the page file does not currently have room to fit the entire commit requested, whereas Linux will not fail, and instead fails only once the application actually page faults on enough pages to overflow its remaining physical and/or page file space.

Thus, while an application reserving 1tb of virtual address space on Windows and Linux might succeed, the attempt to actually commit that memory might fail on Windows unless the page file was that large, but might still succeed on Linux. The Linux app will then fault when the app actually touched enough of the memory to overflow the page file. Again, take that with a grain of salt - that is just my recollection. So this is an important difference in behavior that occurs strictly because the two OSes make different choices about what steps they take when the user asks for things; both have "just reserve" steps that (almost) always succeed, but only Linux has a "commit but not really" phase :)

Bulk physical mapping

The behavior of on-demand physical mapping (waiting for the page fault, then mapping a physical page just-in-time) can be very costly, because it causes constant transitions to the OS as the application first uses each (usually 4K, sometimes 2mb) page. For this reason, there are calls available (some very esoteric) that will map committed virtual pages to actual physical pages in bulk, rather than on the page fault. Applications sensitive to this behavior with knowledge of their memory access pattern can call these functions to map larger groups of physical pages with a single OS transition, rather than once for every page.

Reserve with base address

In addition to being able to ask the OS to provide you with a virtual address range of a particular size somewhere in memory, you can also ask for it to be in a specific place. This allows you to control exactly what virtual address range you get. This actually allows unique optimizations for people who do this sort of thing: pointers are now persistent across runs of a program, for example, and do not have to be "remapped" when data is saved, restored, or sent over a network.

Surprisingly, some OSes not only allow you to specify what virtual address you want to reserve, but which page file/physical addresses you want them to correspond to when you commit. While you cannot specify arbitrary addresses (which would make no sense, because the page file is shared), you can specify existing memory that you wish to map to.

This allows for some surprising tricks, such as automatic circular buffers.

As an example, suppose an application wants a 2mb circular buffer. The application reserves three 2mb memory blocks, each spaced with a virtual address 2mb apart. It then asks the OS to commit one of the blocks (it doesn't matter which), and then asks for the other two blocks to commit to the existing committed block's storage.

This makes all three 2mb virtual address ranges map to the same part of physical memory and/or the page file. So now, there is no code necessary whatsoever to implement the circular buffer. The application merely treats the middle 2mb block as the buffer, and reads/writes to it. When it "overflows" the end of the buffer, it just ends up reading/writing the beginning of the buffer, because the next 2mb memory region after the buffer maps to the same memory. The same thing happens on "underflow", but in reverse.

Commit with user file backing

Much like you can specify an existing committed page to commit more virtual ranges to, some OSes provide the ability to map virtual addresses to existing user file ranges. This is used when the thing being represented in memory is actually just the contents of a file on disk that the user controls. While this may seem redundant, it's actually very useful for eliminating unnecessary page file usage. If an app uses this feature, they pay no page file cost for the memory they are committing, because it is stored in their file, not the page file. If an app doesn't use this feature, when it loads files into memory, they may now be stored on disk twice: once in the source file, and once in the page file backing the physical memory. While this wouldn't matter much for small files that trivially fit in available physical memory, it is obviously very relevant for large ones where paging is expected to occur.

It also has the added benefit of eliminating OS calls that would otherwise have had to occur. The application never needs to issue "read" or "write" operations to bring the file into memory or write it back to disk, since the OS already knows that needs to happen and can take action independently.

Lock

OSes also allow locking virtual pages to physical pages, so that they will not be "evicted" to the page file or relocated to another part of RAM. Obviously, the maximum amount of memory for which this can happen is limited to some portion of the total physical RAM in the system.

Although there's nothing stopping an application from using this feature as a performance optimization ("I know this chunk of memory is important and I never want it swapped out"), it is mainly used for kernel or hardware communication. Because virtual addresses are per-process, and pages can be transient in physical memory, anything that needs to be quickly accessed by multiple processes, the kernel, or hardware collaboratively may need to be locked in physical memory so that it does not need to go through costly kernel operations to ensure consistency.

Decommit (undoes a commit)

When an application no longer needs the operating system to remember the contents of previously committed pages, it can "decommit" those pages. This keeps the address range reserved, so no other operation in that process will reserve an overlapping range. But it allows the operating system to release all resources associated with preserving the contents of the pages, such a physical memory or sections of the page file.

Release (undoes a reserve)

Although only necessary in highly heterogeneous code, if a process is made up of independent "modules", it may be important to be able to release previously reserved ranges of memory. This does not really free up any actual memory resources (decommit frees up those resources), but rather frees up virtual address space so some other part of the process can use it.

Zeroed page guarantee

For security reasons, most modern operating systems never provide a physical RAM page to the process unless the entire page has been cleared to zero first. As a result, well-written programs that use low-level memory systems can avoid clearing their own memory when they know it came directly from the OS. This can save a substantial amount of time which would otherwise be spent by the application clearing ranges of memory that the OS has already spend time clearing itself.

- Casey

aardappel commented 2 years ago

@cmuratori

I believe (but my knowledge might well be out of date) that Windows currently fails to do a commit operation if the page file does not currently have room to fit the entire commit requested, whereas Linux will not fail, and instead fails only once the application actually page faults on enough pages to overflow its remaining physical and/or page file space.

Correct, you'd need to use MEM_RESERVE to get the same ability as Linux to use large amounts of address space, but that typically requires the use of guard pages to actually commit, which entails incremental growth rather than random access.

That is still useful for the majority of use cases though, so any Wasm memory mapping feature would likely want to use this lowest common denominator of incremental access.

Using MEM_COMMIT on Windows is kinda useless since a pagefile is often not that much bigger than physical memory, meaning that it is only barely more useful than just calling malloc for applications that benefit from allocating large amounts of address space all at once (for e.g. never needing to move pointers).

cmuratori commented 2 years ago

With vague apologies for the previous lengthy post, here is another one, with my actual comments on some of the issues with WASM's memory model. In general I would just like to support @juj's comments, which all sound exactly right to me, and also to thank them for the exploration of phone behavior, which is the kind of information that is hard to come by!

Here are the points I thought were important to consider:

There are usually multiple bump allocators in any application, even a game

Even in the simplest design for a very straightforward game post-90s, there are typically at least two bump allocators, not one. There is at least a "frame stack", where memory that is only valid during a single frame is pushed; and a "persistent stack", where memory that is valid forever is pushed.

In more complex designs, there are often several bump allocators. There will at least be one per thread, for the thread's usage. There will probably be one for the frame. There will probably be multiple for persistent things, such as "all time" persistent, as well as "for this level" persistent, etc.

So, in general, that a particular programming style may eschew malloc/free in favor of exclusive use of bump allocators does not imply that the application would be straightforward to implement on top of a single bump allocator.

A single OS-supported memory range is significantly worse than multiple, even for single-threaded applications

When multiple bump allocators are used that must share a single virtual address range (as they would in the current WASM), there are two ways this can be implemented. One way is to sub-allocate the entire range for each allocator up front, and the other is to sub-allocate in pages dynamically.

The first version requires the programmer to know a priori exactly how big each bump allocator needs to be at maximum, and then the single OS-supported memory range is partitioned into those pieces. The bump allocator now functions as normal with no overhead. This is difficult (or sometimes impossible) to do in practice, because especially when user-generated or editable content is at play, there may be no way to guess how big each allocator should be.

The second version requires the bump allocator to operate with its own idea of "pages". As each bump allocator needs to grow, it first checks to see if has space, and if it does, it proceeds with minimal overhead. If it does not, it takes the next "page" from the OS-supported memory range for its own use. When the allocator retreats, it "releases" these pages for use by other bump allocators. This design also has the drawback that the maximum contiguous size that can be provided by the allocator is one underyling "page", whatever the app chooses that to be.

As is hopefully obvious, these are usually inferior solutions to the much simpler and faster OS-supported version on 64-bit OSes. On these platforms, an arbitrary number of bump allocators can provisioned by asking the OS to reserve (but not commit) a large virtual address range for each - say, 16gb. Because this costs the OS practically nothing, you can have many 16gb virtual address ranges reserved, one for each allocator.

The bump/retreat process is now just one check to see if any new pages need to be committed/released. Otherwise, everything else "just works", the memory ranges are always nicely contiguous.

The reason this works is because the OS and CPU effectively already implemented all the necessary stuff, and all your memory accesses go through that machinery anyway. So if you don't expose the ability to do RESERVE/COMMIT/DECOMMIT/RELEASE on multiple ranges, you effectively consign the application to reimplementing the entire OS/VMU subsystem at great cost :(

WASM's current design isn't that far from a more flexible alternative

There are two obstacles to efficient use of bump allocators on WASM's current memory design. One is the use of a single virtual address range, and the other is the inability to decommit pages.

The single virtual address part is straightforward: because WASM allows the programmer 2gb or 4gb of contiguous virtual address space, there is no way to create multiple bump allocators each with the ability to grow to use a full 2gb or 4gb of memory. If the WASM memory model were changed, such that there was an actual reservation op, this would greatly improve the flexibility.

The decommit part has already been discussed in detail by @juj. Although pages in WASM are apparently committed automatically (I have not looked at the implementation), there appears to be no way for the user to decommit a page. Therefore, were they to create multiple bump allocators with multiple ranges using a new "reserve" WASM op, they would still face a problem that the total memory commit would be significantly larger than necessary, since as some bump allocators retreated, they would not be able to release their resources back to the system for use in provisioning pages to other bump allocators.

Ring mapping and guaranteed zero pages can be important

The ability to create automatic ring buffers, and the ability to know that initial page provisions from the OS are zeroed, can both lead to substantial optimizations. It's worth considering whether these things warrant inclusion in WASM in some way, depending upon the range of platform support that WASM currently has (I am unfamiliar with what the weakest OS support for memory mapping is across the range of WASM-supporting devices, so I do not know whether, for example, circular buffer mapping would be supported on all WASM platforms or only a subset).

- Casey

cmuratori commented 2 years ago

Using MEM_COMMIT on Windows is kinda useless since a pagefile is often not that much bigger than physical memory,

That is not the use case for MEM_COMMIT that would be relevant here, though. It is actually very important for dynamically sized bump allocators. For a typical bump allocator, you will MEM_RESERVE 16gb or something like that, then only MEM_COMMIT incrementally as you need it. This allows for all your bump allocators to potentially grow to encompass large amounts of memory without needing to know ahead of time which ones will need to do so, or when. You then MEM_DECOMMIT as they shrink, to allow them to dynamically balance.

- Casey

aardappel commented 2 years ago

@cmuratori yup that's what I meant. I was comparing against the case where you'd MEM_COMMIT all of it, as you can on Linux, if you have a more random access use case.

Agree that "multiple (non-moving) bump allocators" is a very important use case.

My description above is incorrect though, you can commit incrementally also randomly on Windows, with AddVectoredExceptionHandler that will fire on any uncommitted access, not just ones marked with PAGE_GUARD.

cmuratori commented 2 years ago

My description above is incorrect though, you can commit incrementally also randomly on Windows, with AddVectoredExceptionHandler that will fire on any uncommitted access, not just ones marked with PAGE_GUARD.

That is what I thought was the case as well, but since I've never actually implemented that, I wasn't 100% sure. Good to know...

- Casey

verbessern commented 2 years ago

I think the answer to the question "Is memory shrink needed?" is related to this: what will happen if all executables in the OS are unable to free memory, but only allocate? That is quite a disturbing question, is it not?

Currently the standard assumes that the memory only grows, and the threads proposal takes "advantage" of that for the shared memories, by only checking the length of the memory atomically, not locking the whole memory length. If the memory is allowed to shrink, then that will force all the threads to synchronize per each memory access.

With other words, to be able to "load", only one thread must be active in this time, to ensure that the memory does not shrink in the same time and invalidate the load address. Same for "store". I personally think that the inability to shrink memory is a design flaw, that should not pass the MVP step.

I understand the performance hit on locking, but the shrinking of memory is still more important then that. Maybe some readers (load/store) vs writers (grow/shrink) problem had to be explored, to mitigate the effect.

cmuratori commented 2 years ago

Digging around a little bit, I notice this:

https://github.com/WebAssembly/multi-memory/blob/master/proposals/multi-memory/Overview.md

Assuming I read it correctly, it seems like it would address the main issue with multiple bump allocators, but not the decommit part that @juj originally raised. I will file an issue on that spec, referencing this discussion.

- Casey

aardappel commented 2 years ago

@cmuratori multiple memories would not allow for multiple bump allocators, as they produce unrelated address spaces, i.e. objects allocated on 2 of such spaces could not be passed to a single function operating on them (load instructions have the memory index statically baked in).

Also, it is not known yet what performance hit a Wasm engine will take supporting multiple memories (which could be an additional indirection for each memory access in the worst case).

cmuratori commented 2 years ago

Also, it is not known yet what performance hit a Wasm engine will take supporting multiple memories (which could be an additional indirection for each memory access in the worst case).

I assume this is because some important target hardware does not support large virtual address ranges?

- Casey

aardappel commented 2 years ago

@cmuratori browsers prefer not to hand out large amounts of virtual address space to each browser tab that may run one or more Wasm modules / memories (I forget why, I seem to remember there are security related reasons), so reserving all memories consecutively is maybe out of the question. Besides that, memories can be shared, allocated from JS etc. Hence, an indirection may be needed.

cmuratori commented 2 years ago

@aardappel But presumably this is an implementation detail specific to browsers? For example, if you were running WASM not in a browser - say, in a native Electron instance - could this restriction not be relaxed?

I guess I would just point out that there are other uses for WASM that will not involve browser security models, and it would be a shame not to have the ability to encode modern, efficient memory semantics for those situations.

- Casey

ShadowJonathan commented 2 years ago

Some 2c; coming from the world of Kubernetes, PaaS, and containerisation, WASM looks to be incredibly promising from an optimisation, packaging, and sandboxing standpoint.

Thus, these kinds of use-cases require almost all the kinds of features and tricks a "server program" might give. What i'm reading here mostly confirms that, where I can see the possibility of something akin to a WASI-imported module giving more flexibility around memory usage, pages, and more.

The issue of not being able to "deallocate" and give back memory to the system bugs me the most, because if a program will want to try to work around that, it's essentially working with allocated memory again, which breaks the guarantee of zero-pages, for which the application would then have to work around, again.

I don't know a good solution to this though, I just want to give a perspective from the world of server applications, for which WASM could be a suitable alternative, but which requests an on-par memory flexibility functionality, almost similar to current-day "normal" applications.

Some interesting additionals

[Krustlet, WASM Kubelet](https://krustlet.dev/) [Hippo, WASM PaaS](https://docs.hippofactory.dev/) [Docker Dev commenting how WASM+WASI would've made docker redundant if it existed in 2008](https://twitter.com/solomonstre/status/1111004913222324225)

aardappel commented 2 years ago

@cmuratori oh sure, and we definitely want Wasm to work well absolutely everywhere, if possible. But something not working well in the browser tends to be a deal breaker for features going into core Wasm :)

aardappel commented 2 years ago

@cmuratori as a bit of fun history, you'll appreciate that my first ever interaction with the Wasm team (> 6 years ago!) was this issue, which was me worried that Wasm would be designed to be too Web specific, and arguing for game use cases etc: https://github.com/WebAssembly/design/issues/249

Have been arguing that ever since :)

flaki commented 2 years ago

I am horribly late to the party, but I have observed three main themes in this (huh, long!) thread:

Linear memory memory, but it actually works (and ideally with shrink semantics / some ability to release allocated space)
Long-lived flexible allocations of very specific intended usecases (the flexible and individually growable/shrinkable arenas above)
Short-lived "bust" memory ranges that can be discarded after some initial computation, like the one described by @juj originally:

This is perhaps a bit too simplistic model. If we look at app loading flow, then under current .grow() only model, it will actually be the "second document load" (document/game level/asset/...) that will cause the most simultaneously consumed address space pressure, e.g.:

Load a large, say, 300MB, document to JS memory (e.g. from XHR or IndexedDB - former could stream, latter cannot)

wasm.grow() memory +300MB to fit the document, memcpy the document to Wasm memory

unload doc from JS memory, -300MB,

wasm.grow() a second time, for, say, +1GB to unpack/expand the doc in wasm, and fit a working memory area for processing the document

document unloads, -1.3GB of unused memory in wasm (would .shrink() here if available)

load a (the same?) document again, e.g. +300MB to JS memory, but OOM since cannot find address space for 1.3GB + 300MB simultaneously.

There doesn't seem to be much spec-wise to do about the first issue, rather than fixing the reservation strategies. Bolting-on shrinkable memories seem to be a massively non-trivial and any sort of complexity incurred by deallocation/defragmentation would probably incur portability and/or security issues. However.

It feels like that fixing the second/third usecase with something that is more akin to separately-manageable "secondary" memories would be able to sufficiently alleviate the pressure on the main memory (which could be kept reasonably small). Unfortunately, as mentioned by Andreas in the Multi-Memory proposal issues and above, the multi-memory proposal doesn't really fit the bill, but perhaps fixing all the above issues might warrant a separate vehicle. Such a "secondary" or even "buffer" memory:

would probably need to be a first-class memory, aka memoryref (as per Andreas in the linked issue)
can be created & destroyed freely and the wasm module should be able to request it
there may be some caveats of it going away and expectations that the WASM module should be able to recover

The last one could potentially be a flag, and would support the buffer/ephemeral reservation usecase and also the quick task switching usecase -- for the later I'd imagine certain rendering/frame-specific processing could be kept in such secondary memories, which could be discarded upon switching away from the app. When the browser would be re-activated, the secondary memory would be re-generated and the WASM app would resume. Similarly if the user hopped out to reply to a chat message during loadtime, the WASM buffer could be thrown away and simply the WASM would restart the processing from the start without the main app state having been thrown away. Potentially growing these secondary memories could be supported for more flexibility in other workloads. I would imagine such buffer memories would have their own instructions and infrastructure.

juj commented 2 years ago

Since the initial creation of this thread, and with Unity 2021.2 now shipping with targeting improved mobile browsers support, we have a number of customers reporting their experience and pain points being quite similar to the initial investigation above.

Either Unity pages don't run at all/they oom later unless one does manual tweaks to allocate "just the goldilocks amount of initial memory" (as one dev put it), or

Unity pages might run ok while the tab is active, but switching out from the tab/browser and later coming back in, the browser has decided to kill the page in the meanwhile. This is reported particularly tricky for UX in web chat applications that embed social games in them, since it causes the chat application to lose the opened "conversation & game context" as well (and apparently push notifications also?), requiring users to re-navigate back to their conversations and opened games, and has required developers to start implementing "defensive page-got-reloaded strategies".

The way that apps would mitigate this would be to minimize their memory usage on "app backgrounded" event (browser window 'visibilitychange'/'pagehide' event), but like discussed on this thread, there is unfortunately no current way to act here on wasm apps.

So far above conversations have been agreeable towards adding a decommit type of operation to wasm memory pages. We believe adding that would certainly solve the issue for 64-bit browsers, and would certainly be a good concrete step forward here.

Maybe as part of adding decommit, improving the expected commit vs reserve behavior should help the issue on 32-bit browsers as well.

@aardappel has opened the conversation for adding support for reserving address space to wasm memory64. A decommit operation would certainly help open more programming techniques when used with wasm memory64 as well.

Multi-memories were discussed above, though my understanding of that feature concurs with aardappel's comment

@cmuratori multiple memories would not allow for multiple bump allocators, as they produce unrelated address spaces, i.e. objects allocated on 2 of such spaces could not be passed to a single function operating on them (load instructions have the memory index statically baked in).

I.e. current Clang/LLVM C/C++ -> Wasm compilation model does not well extend to transparently compiling programs to use a multi-memory model. Supporting such a feature would require manually architecting custom pools, e.g. main app in memory nr.0, whole filesystem in memory nr.1 (or individual large files in their own memories?), audio data in memory nr.2, etc. leading to a considerable rearchitecting of application code, and likely(?) leading to needing to extend Clang with custom pointer type attributes (near and far pointers style of programming back in fashion?)

Because of that, I do not see multi-memory as a possible solution to this issue.

The only back-and-forth part in this thread has been the question of whether shrinking a Wasm memory should be allowed. In a singlethreaded Wasm application implementing shrinking would likely be quite trivial, but it is the shared Wasm memories and multithreaded programs that pose a challenge.

@syg's resizable ArrayBuffers draft at https://github.com/tc39/proposal-resizablearraybuffer proposes a shrinkable ArrayBuffer, but not a shrinkable SharedArrayBuffer.

One possibility here would be to follow suit and allow nonshared Wasm Memories to be shrunk. That would allow the benefit of enabling many singlethreaded applications to leverage the opportunity to shrink where possible, symmetric to the above proposal. It could offer a way for the wider industry to validate the benefits of shrinking, and a concrete way to then either pursue multithreaded shrink or have the industry prove if it indeed is mostly useless, only at a "beauty benefit" cost of retaining behavior symmetric to ArrayBuffers (which iiuc Wasm Memories still have behaved so far - you can pull the ArrayBuffer API out of a Memory).

I still do not 100% agree with @lukewagner's argument that shrinking would be too opportunistic to work. In a "random" malloc memory access patterned application, tha would indeed be the case, but in an application that already implements a host of custom memory allocators, like games employing arena/bump/slab allocators or, e.g. Unity DOTS content model employing highly pooled ECS archetype memory slabs, there is a considerable opportunity and potential to shrink(), especially at points where a "unload everything"/"close everything" action takes place by the user.

In an application that is already having to combat fragmentation by implementing pooled allocation techniques would not be worse off with the ability to also shrink(). So it does not seem that the implementation complexity would increase any amount compared to what the current challenges of defragmenting allocation techniques already pose - just the applications that do tend to fragmentation will be able to do so more effectively than before.

If all the devices that wasm targets were 64-bit today, we might not care about a shrink() operation - though currently only a small single digit % of Android Chrome browsers in the wild are running 64-bit. I think shrink() would provide more opportunity for shipping to 32-bit browsers.

The shrinkable ArrayBuffer proposal does carry appeal for Unity, and even if it does not allow a magic bullet of defragmentation, it would still allow reaping opportunity for reducing address space pressure on 32-bit browsers when a codebase is already doing something to look after its allocation patterns.

(As a sidenote, thinking about shrinking multithreaded memories, it feels like it could be safely done if observing shrinking was modeled to be inherently asynchronous to other threads - i.e. a VM could splice off the shrunk memory area only after it has had the time to safely coordinate the propagation of the new reduced address range to all threads. That could operate in spirit similar to how garbage collection works. However I do not want to pursue this topic too hard, since it has the danger of trailing focus off from progressing the conversation)

To summarize, our perspective is that to help Wasm forward on mobile:

add implicit first-write-to-a-memory-location-commits-a-page semantics, (iiuc that would be greatly preferred over an explicit commit-a-page operation)
clarify the required initial commit vs reserve state that a Memory should have after creation so that VM implementations can align without needing custom heuristics,
add a wasm API support for decommit()ing a page,
if possible to follow suit with @syg's proposal, add shrink() support [at least] to nonshared Wasm Memories.

That would be the "barebones" type of commit-vs-reserve behavior that would at least solve the mobile problems today.

With respect to shrink, I am convinced that for 64-bit mobile browsers, the absence of shrink will not cause an issue.

Like I mentioned before, I unfortunately don't think I am well positioned to drive a proposal for the feature, but I am still hoping that someone e.g. at Google, Mozilla or Apple closer to implementing and/or maintaining the browser Wasm VMs could be vested to champion to driving this feature?

Though as the developer and maintainer of the Emscripten emmalloc allocator implementation (and relatively close to dlmalloc as well), I'd be eager to coordinate on adding support targeting and validating an implementation into Emscripten (and naturally to Unity).

aardappel commented 2 years ago

I am fine with this reasoning and would support "shrink" functionality as useful, even if single threaded only initially.

That said, playing advocate of the devil on this one:

If all the devices that wasm targets were 64-bit today, we might not care about a shrink() operation - though currently only a small single digit % of Android Chrome browsers in the wild are running 64-bit.

So, a year ago, this article already claims "Nearly 90 percent of today’s Android devices deploy a 64-bit capable version of the OS" (I know, not the same stat, but still) and also "Graphically demanding games such as Epic Games’ Fortnite are already 64-bit only".

Now take it how many years it will take for "shrink" functionality to arrive in browsers, and then developers to release games making use of it, aren't we going to be in an entirely 64-bit world by then?

Your "single digit %" stat, is that what released Unity games are seeing in the wild? Seems strangely low to me.

juj commented 2 years ago

It is definitely true that almost all new sold Android devices are 64-bit (although last time I looked in 2020, there were surprisingly still new 32-bit Android devices coming out in India and Pakistan markets), and we do also see this in our analytics. For the overall marketshare, in our analytics we do not reach a 90% prevalence count, and actually not even a 80% bar.

However the actual issue is that Chrome on Android has not yet completed its 64-bit transition. According to this source, the requirements for getting an automatic update to 64-bit Chrome is to have Android 10 with at least 8GB of RAM on the device. The release announcement in Chromium was from March this year.

According to appbrain.com stats, ~56% of devices are Android 10+ (~10% higher than what we see in our analytics), which looks like a promising trend, but the requirement to have 8GB of RAM is probably a killer. The average gaming phones that are sold to the market in 2021 have just 4GB of RAM. In our analytics, 8GB+ of RAM does not even show up as a meaningful number of users from the huge sea of devices with <= 4GB of RAM.

We have been discussing Chrome's 64bit further rollout plans in our direct calls with Google, though I am unsure to share details just to err on the side of caution. (those are probably public info but could not find a source right now) Based on the above Chromium blog post paired with our analytics, I do not expect 64-bit Chrome overtaking 90% adoption rate within the next three years at least with the current trajectory.