Pooling allocator: investigate combining separate mmaps into one (unified per-instance) slot

cfallin commented 2 years ago

After some discussion with @alexcrichton just now, the question of why wasmtime's pooling allocator maintains separate pools for memories, tables, and instance structs (and stacks, but those are special on Windows at least) arose.

Especially if we're considering using madvise to flash-clear state and make instantiation faster, it might make sense to investigate whether we can "transpose" the arrangement and put all of an instance's resources in one contiguous region (with appropriate guards of course).

For example, one instance slot could contain all of that instance's tables and memories, and the Instance struct, and perhaps the stack on non-Windows platforms.

The major advantage of this is that we can decommit (flash-clear) the whole region with one madvise syscall. In contrast, today, we need a separate madvise for each resource. This cost is somewhat hidden today because we write ~every byte of the Instance and the tables (with eager init) but if we lazily initialize state in tables and/or Instances (e.g. lazy anyfunc init) this could matter more.

alexcrichton commented 2 years ago

cc @peterhuene, you probably know the reason for separate mmaps here.

One idea is that we could possibly have less fragmentation by putting everything in one since we don't technically need guards between the instance and the tables (in theory at least), although we do for sure needs guards on stacks and on memories.

peterhuene commented 2 years ago

The primary reason for the disjoint instance resources is that it greatly simplified the uffd implementation as we could just monitor a single range for all linear memories rather than multiple discrete parts of a range representing an instance's linear memory pages (or, alternatively, monitor pages meant for tables or stacks).

Indeed, the cost was additional madvise calls to make it simpler. Lucet has the design proposed here where all instance resources remain in a continuous slot (and thus a single madvise call), but its uffd handler has to handle faults in pages not meant for linear memory too.

A single madvise call is definitely preferable and if we're replacing the uffd implementation, I think this makes sense.

bjorn3 commented 2 years ago

Using separate pools would also make it harder to leak pointers (or even cause code to interpret attacker controlled bytes as a pointer) in case of bugs I would guess.

alexcrichton commented 2 years ago

A bit belated but a month or so I did some investigation into this to see what would happen. My findings at the time was that the cost for an madvise call was proportional to the size of the memory that we're madvise-ing, even if that memory wasn't present. My conclusion at the time was that it was better to do the little madvise calls we do today rather than lumping everything into one giant allocation where one madvise call is all that's necessary to blow away.

I wasn't super scientific in my findings, though. Additionally it was also awhile ago I collected data for this and I probably don't have the branch around any more. Wanted to note though that I don't think this is necessarily a slam dunk and before trying to proceed with this (if ever), we'll want to carefully evaluate performance.

bytecodealliance / wasmtime

Pooling allocator: investigate combining separate mmaps into one (unified per-instance) slot #3754