bytecodealliance / wasm-micro-runtime

WebAssembly Micro Runtime (WAMR)
Apache License 2.0
4.84k stars 618 forks source link

[RFC] Assign part of wasm address space to a shared heap to share memory among wasm modules with zero-copying #3546

Open wenyongh opened 3 months ago

wenyongh commented 3 months ago

About the requirement

Many scenarios require to share memory buffer between two wasm modules without copying data (zero-copying) and there were developers asking the issue. But since the wasm spec assumes that a wasm app can only access data inside its linear memory(ies), it is difficult to achieve that, normally we have to copy data from the caller app's linear memory to the callee app's linear memory to call callee's function. People may use some methods, like multi-memory, GC references, or core module dynamic linking, but there are some limitations, like the support of toolchain, the user experience to write the wasm application, the requirement of advanced wasm features, the footprint and so on. Here we propose a solution for it: assign part of wasm address space to a shared heap to share memory among wasm modules with zero-copying.

Overview of the solution

As we know, there is address mapping/conversion between wasm address space of linear memory and the host address space: for example, in wasm32, the wasm linear memory's address space can be from 0 to linear_mem_size-1, and the max range is [0, 4GB-1], and there is corresponding physical address space for the linear memory allocated by runtime, let's say, from linear_mem_base_addr to linear_mem_base_addr+linear_mem_size-1. The mapping is simple and linear: [0 to linear_mem_size-1] of wasm world <=> [linear_mem_base_addr, linear_mem_base_addr+linear_mem_size-1] of host world. But since in most cases, the max linear memory size is far smaller than 4GB, we can use the higher region of the wasm address space and map it to another runtime managed heap to share memory among wasm modules (and also host native).

The idea is mainly to let runtime create a shared heap for all wasm modules (and host native): all of them can apply/allocate memory from the shared heap and pass the buffer allocated to other wasm modules and host native to access. And the allocated buffer is mapped into the higher region of the wasm address space: in wasm32 the address space (or we often call it offset) for a wasm app is from 0 to 4GB-1 (which is relative address but not native absolute address), suppose the wasm app's linear memory doesn't use all the space (it uses 0 to linear_mem_size-1 and normally linear_mem_size is far smaller than 4GB), then runtime can use the higher region for the shared heap and map the shared heap's native address space into the region, for example, from 4GB - shared_heap_size to 4GB -1. And runtime does a hack when executing the wasm load/store opcodes: if the offset to access is in the higher region (from 4GB - shared_heap_size to 4GB -1), then runtime converts the offset into the native address in the shared heap to access, else runtime converts the offset to the native address in the wasm app's private linear memory to access. Since the wasm address space of the higher region is the same for all wasm modules and runtime accesses the higher region with same way, a wasm module can pass the buffer inside it to another wasm module, so as to share the data with zero-copying.

And runtime provides APIs to allocate/free memory from the shared heap, e.g. a wasm app can import function like (env, shared_malloc) and (env, shared_free) can call it, the import functions are implemented by runtime. For host native, runtime may provide API like wasm_runtime_shared_malloc and wasm_runtime_shared_free. And the shared heap size can be specified by developer during runtime initialization.

From the view of wasm app, it has two separated address regions, and it is not a standard behavior of the wasm spec, but it doesn't break the wasm sandbox since the memory access boundary checks can be applied for both the two regions. There is a performance penalty since additional boundary checks should be added for the higher region, but I think it should be relatively small and should be acceptable compared to copying buffer mode.

Eventually, when a wasm app wants to share a buffer to another wasm app, the code may be like:

    buffer = shared_malloc(buffer_size);
    write data to buffer;
    call func of other app with buffer as argument
    ...
    shared_free(buffer);

image

Main changes

Others

WenLY1 commented 3 months ago

Implementation tasks:

yamt commented 3 months ago

do you mean to have a single global shared memory, which all wasm modules with a linear memory on the system can fully access? i guess it would be a bit more usuful if you can control it. eg. multiple shared memories, which the embedder can selectively associate to wasm instances.

semantically, is the shared region always treated as if it's a shared memory? eg. does it prevent some possible optimizations like dead load/store eliminations?

woodsmc commented 3 months ago

Awesome. At the W3C in person event, Deepti presented on providing a mmap function. In her use case this is used to mmap hardware memory into the WASM VM's address space, and could perceivably physically call mmap on the host system. The remapping of linear memory to the mmap space was discussed as well, and idea was to use the lower memory address range as an easily re-mappable memory region with an efficient load / store implementation.

Is it worth while checking in with Deepti, to ensure there is no clashes, kinda wondering if we could end up with the lower chunk of linear memory made available to an mmap instruction, and the uppper chunk reserved from sharing between modules. It might be nice / important to be able to mmap host memory and share it between modules too?

wenyongh commented 3 months ago

do you mean to have a single global shared memory, which all wasm modules with a linear memory on the system can fully access? i guess it would be a bit more usuful if you can control it. eg. multiple shared memories, which the embedder can selectively associate to wasm instances.

Yes, in current discussion, it is supposed to only have a global shared heap, each wasm module can access it, and the shared heap is created during runtime initialization. Your idea sounds reasonable, but then runtime should create the shared heap lazily, the working flow may be like below:

For performance consideration, I think we had better restrict that each wasm instance can only associate to one shared heap, or it will be too complex and might greatly impact performance. How do you think?

semantically, is the shared region always treated as if it's a shared memory? eg. does it prevent some possible optimizations like dead load/store eliminations?

Yes, the shared region is always mapped to the shared heap, and my suggestion is we always use software boundary check for it since we have to add extra check for which region the wasm addr belongs to, so it should be able to prevent dead load elimination optimization.

yamt commented 3 months ago

but then runtime should create the shared heap lazily, the working flow may be like below

yes.

if you want, you can still create the first shared heap on the runtime initialization. i don't think it simplifies things much though.

I think we had better restrict that each wasm instance can only associate to one shared heap

i agree.

wenyongh commented 3 months ago

Awesome. At the W3C in person event, Deepti presented on providing a mmap function. In her use case this is used to mmap hardware memory into the WASM VM's address space, and could perceivably physically call mmap on the host system. The remapping of linear memory to the mmap space was discussed as well, and idea was to use the lower memory address range as an easily re-mappable memory region with an efficient load / store implementation.

Thanks @woodsmc. Not know how mmap function is used, does it mean that the wasm app can call the mmap function, and runtime maps the mmapped memory to the lower range of wasm memory address space and then change the behavior of wasm load/store accordingly? And can wasm app call mmap function multiple times? If yes, it may impact performance a lot. And not sure why map to lower range of wasm address space: (1) IIUC, the 0 of wasm addr is reserved for the check for C NULL pointer by clang, clang reserves a space from 0 and doesn't put the app's global data at 0, (2) if map to the lower range, then wasm app should reserve a relatively larger space for it, it may be not so convenient for toolchains/developers, at least for clang, developer should add --global-base=n option.

Is it worth while checking in with Deepti, to ensure there is no clashes, kinda wondering if we could end up with the lower chunk of linear memory made available to an mmap instruction, and the uppper chunk reserved from sharing between modules. It might be nice / important to be able to mmap host memory and share it between modules too?

Yes, it would be great if we discuss more with Deepti. I think we should be able to support both the mmap and the shared heap if needed since one uses the lower range of wasm addr space and the other uses the higher range, but maybe we don't need to support mmap when the shared heap is enabled, since runtime can also use mmap function to allocate the memory for shared heap or even provide callback for developer to allocate the memory.

no1wudi commented 3 months ago

I guess that implementing shared memory using either the shared heap or mmap methods will make memory boundary checks more complex. Therefore, I have an idea about boundary checks in https://github.com/bytecodealliance/wasm-micro-runtime/issues/3548.

Perhaps it could serve as the basis for implementing shared memory functionality. What do you think?

wenyongh commented 3 months ago

@no1wudi yes, maybe we can add another option for wamrc to allow to call runtime API to do boundary check in AOT/JIT mode, but I am not sure whether it is good to make it as default mode for the shared-heap/mmap functionality, had better test the performance to see the result first?

no1wudi commented 3 months ago

@no1wudi yes, maybe we can add another option for wamrc to allow to call runtime API to do boundary check in AOT/JIT mode, but I am not sure whether it is good to make it as default mode for the shared-heap/mmap functionality, had better test the performance to see the result first?

The performance overhead does need to be tested. What I can confirm is that if boundary checks are implemented using an if-else if sequence in LLVM IR, it will significantly increase the code size. In some of our applications, the code size could double as a result.

ayakoakasaka commented 3 months ago

Yes, in current discussion, it is supposed to only have a global shared heap, each wasm module can access it, and the shared heap is created during runtime initialization. Your idea sounds reasonable, but then runtime should create the shared heap lazily, the working flow may be like below: runtime initialization as normal runtime creates shared heap 1 some instances associate to shared heap 1 runtime creates shared heap 2 some instances associate to shared heap 2

Could this association conform to the principles of a component model in the future? Being able to restrict the accessible area per component (or similar concept) would support a variety of use cases.

wenyongh commented 3 months ago

Yes, in current discussion, it is supposed to only have a global shared heap, each wasm module can access it, and the shared heap is created during runtime initialization. Your idea sounds reasonable, but then runtime should create the shared heap lazily, the working flow may be like below: runtime initialization as normal runtime creates shared heap 1 some instances associate to shared heap 1 runtime creates shared heap 2 some instances associate to shared heap 2

Could this association conform to the principles of a component model in the future? Being able to restrict the accessible area per component (or similar concept) would support a variety of use cases.

Yes, when component mode is implemented in the future, I think we can also associate part or all of the instances inside the component to a shared heap, and for the latter, we may add an API for the component to associate all its instances to a shared heap. It depends on the requirement.

fridaymore commented 2 weeks ago

Would it be possible to have a module default to using shared_malloc / shared_free when making malloc / free calls without rewriting the code?

Example use case: Module A sets up shared data structures in the shared heap and has no need for the private heap. Module B uses both the shared heap and the private heap. Since everything Module A is setting up should go on the shared heap, having it default to using the shared_* methods would make things simpler (especially if we use libraries that malloc/free under the hood).

wenyongh commented 2 weeks ago

Would it be possible to have a module default to using shared_malloc / shared_free when making malloc / free calls without rewriting the code?

Example use case: Module A sets up shared data structures in the shared heap and has no need for the private heap. Module B uses both the shared heap and the private heap. Since everything Module A is setting up should go on the shared heap, having it default to using the shared_* methods would make things simpler (especially if we use libraries that malloc/free under the hood).

Do you mean in the runtime's module malloc implementation, when failed to allocate memory from the private heap, runtime continue to allocate memory from the shared heap?

fridaymore commented 2 weeks ago

Do you mean in the runtime's module malloc implementation, when failed to allocate memory from the private heap, runtime continue to allocate memory from the shared heap?

Sorry, let me try to explain with a concrete example:

In the example below, I am calling malloc in my C function. When compiled to Wasm, it calls emscripten_builtin_malloc. I am assuming that emscripten_builtin_malloc will use the non-shared heap in linear memory. Is there a way to have it use the new shared_malloc function without having to change the DoSomething code?

#include <emscripten.h>
...
EMSCRIPTEN_KEEPALIVE uint8_t* DoSomething() {
  auto ptr = (uint8_t*) malloc(1);
  *ptr = 1;
  return ptr;
}
(func $DoSomething (export "DoSomething") (type $t1) (result i32)
    (local $l0 i32)
    (i32.store8
      (local.tee $l0
        (call $emscripten_builtin_malloc
          (i32.const 1)))
      (i32.const 1))
    (local.get $l0))
wenyongh commented 1 week ago

Hi, a possible way that I can think is to find which object file implements the emscripten_builtin_malloc in emsdk's libc.a and then remove it from libc.a, for example, emar libc.a xxx.o and emrandlib libc.a, refer to pthread_library.md and tensorflow build.sh. And then implement your own emscripten_builtin_malloc native API and register it to wamr runtime, and let the native API calls wasm_runtime_shared_heap_malloc.

fridaymore commented 3 days ago

Thank you for the suggestion. Do you have an estimate for when this feature will be available in the main branch?