Returning arrays with snowman-bindings

PoignardAzur commented 5 years ago

After the discussion we had during the July 18th meeting, I'm reconsidering how dynamic-sized values (eg, arrays and strings) should be returned from boundary calls.

The current direction is that, for the following C++ code:

auto array = someWasmModule_getArray(...);
auto someData = array[i];

The C++ module should give its host a malloc-equivalent function at compile-time. Then, at runtime:

The C++ module calls someWasmModule_getArray,
someWasmModule creates an array internally and returns it,
The hosts calls the malloc function provided by the C++ module,
The host makes a byte-by-byte copy of the array from someWasmModule to the address in linear memory allocated by malloc,
The host returns that address to the C++ module.
The C++ module can then access its copy of the data freely.

There are some problems with that workflow:

It makes a memcpy in every case. Given that this copy goes to a language-controlled section of linear memory, it might be difficult or impossible for the host to elide that copy away in cases where it would otherwise know the copy is superfluous.
- For instance, in the above code, the C++ module only needs one element of the returned array.
- A function might fetch a string from the DOM, then pass it to a search function, then discard it; in which case, allocating space in linear memory only to discard it immediately afterwards is useless overhead.
There is no obvious way to return an array of references.

While these aren't blocking problems in the short term, in the long term, they might constrain the types of data that can be exchanged between wasm modules and hosts; especially if my recent OCAP bindings proposal gets traction.

So I'm wondering it might be worth the cost to bite the bullet and implement first-class variable-sized types in wasm.

By first class, I mean having them as valtypes, that can be stored in stack variables, function arguments, return values, globals and so on.

Doing so would add some complexity to wasm:

An additional generic type
Additional instructions for:
- Creating an array (from values, from tables, from linear memory),
- Indexing, slicing, getting the size,
- Copying, moving.

I'm not familiar with the internal of the big wasm VMs. How big of a cost is this? How hard a sell would it be as an addition to the spec?

I think it's not overly complex, semantically. We're not talking about generics or monads here. Every instruction could still be validated in O(1) time (though array copying would be O(n) at runtime).

It would be, essentially, splitting off another part of the GC proposal, and implementing it as a stack-only feature, to give the host more information easily accessible to the compiler in some cases.

What do you think?

lukewagner commented 5 years ago

IIUC, what you want would be satisfied by using a wasm GC arraytype.

PoignardAzur commented 5 years ago

IIUC, what you want would be satisfied by using a wasm GC arraytype.

What I'm proposing is for non-GC array types.

PoignardAzur commented 5 years ago

To expand, I'm mostly looking at this from two perspectives:

1: Minimizing overhead in common situations where the host compiler can elide copies if it has enough info to do so.

Copying information to the linear-memory prevents any optimization, because the host can't trivially know whether the slice of linear memory used is going to be reused later, even if the language compiler knows that memory passed to free is discarded.

Allocating a GC array is easier to optimize, because the host can perform perform escape analysis and delete the reference early, or even skip its allocation, but it's more expensive when not optimized: GC allocations can trigger collections, tend to spread data around and increase cache misses, etc. More importantly, not all hosts and languages are GC-compatible or want to be.

What I'm proposing is having a first-class, pass-around-by-copy-or-move array type that can be allocated and stored on the wasm stack. That type makes it easier to elide copies in the use cases described above; and, in cases where a function creates an array where wasm can determine its size statically, its contents can be allocated directly on the stack, which maximizes data locality and minimizes allocation costs.

The downside it might encourage languages to pass copies of arrays in cases where users would expect them to pass references.

2: Allowing the compiler to express type info with better granularity.

There's been some debate around the implementation of array types in webidl/snowperson-bindings; in particular around cases where, when binding incoming binding expressions, wasm needs some way to reserve memory for variable-size types.

@fgmccabe pointed out that giving external libraries access to an allocator function gives them an undesirable level of control on your module's internal memory. While this isn't inherently unsafe, it could be a possible attack vector combined with other vulnerabilities.

I pointed out above the problem of returning arrays of opaque types (eg calling myFolder.getFileList()).

More generally, if we want an interop ecosystem, I think wasm should give compilers the power to express what type of data they're manipulating. It's not super important when dealing with a single monolithic program that manages its own block of memory, but it becomes important when communicating between untrusted modules.

If you give compiler A the ability to express "I'm exporting a function that returns an array" as a first-class type, not just an abstract binding, then compiler B can handle that type directly, and read it or allocate memory for it without passing allocators to the host or the other module; compiler B has more discretion, the allocation happens within its own control flow, and the host has more information if it wants to inline and optimize the call.

lukewagner commented 5 years ago

If you want a first-class array-typed value, then it needs to go in the core wasm language; the scope of the bindings proposal is only that which can be done "at the boundary", hence binding-type-arrays ultimately ending up linear memory or gc memory, b/c those are the two things core wasm has first-class access to.

PoignardAzur commented 5 years ago

I know. I'm just posting this here because foreign bindings would be the n°1 use case for such a feature.

Mostly I'm wondering if that use case is convincing enough to justify making a proposal.

PoignardAzur commented 5 years ago

From the discussion at the July 25 chat, it came out that a first-class array would be a lot of political/technical effort to add to the spec, and it's dubious that C++ (the primary target for such a feature) would get much from it, because LLVM doesn't really know how to interpret non-linear-memory types.

So I'm closing this issue for now, though I'm still interested in discussing slice-passing semantics further.

WebAssembly / interface-types

Returning arrays with snowman-bindings #49