Open kripken opened 5 years ago
@lukewagner @littledan you expressed some interest in this.
I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.
As for view8_s(start, end)
, etc. This would be identical to new Int8Array(memory.buffer, start, end - start)
, right? Seems like these functions aren't as useful as the load/store ones.
Not knowing enough about the total design space, I really like having typed loads on the memory object itself.
A question I have is, would it be reasonable to put these load
methods on ArrayBuffer directly? Are there semantic differences between an ArrayBuffer and a WebAssembly.Memory? It feels like an API inconsistency to me to have these scoped to only wasm.
@jgravelle-google Seems unlikely that ES would want that, since this functionality is already provided by DataView
.
This proposal sounds good to me, at a high level.
This would be the first time that WebAssembly text format instruction names would show up in JavaScript AFAIK. Bikeshedding, for the load and store API, I'm wondering, what if we gave WebAssembly.Memory
methods of the same name and signature as a DataView? Or, maybe this doesn't make sense, since these methods should use native endianness, but DataViews are always based on an explicit endianness choice.
I don't understand the view API. Is this about making a TypedArray over the WebAssembly.Memory? Would this TypedArray be expected to remain "live" as memory grows?
Or, maybe this doesn't make sense, since these methods should use native endianness, but DataViews are always based on an explicit endianness choice.
Oh, that's an interesting point. WebAssembly is always little-endian, but TypedArrays are platform byte order. So using normal views is already going to be incorrect behavior.
Would this TypedArray be expected to remain "live" as memory grows?
I think they're meant to be normal TypedArrays, so their size won't change.
As for view8_s(start, end), etc. This would be identical to new Int8Array(memory.buffer, start, end - start), right? Seems like these functions aren't as useful as the load/store ones.
Yeah, good point, the view API here is less useful. I forgot about the typed array constructor with 3 parameters - and even without that it's not a huge win, I guess. It does seem kind of nice for completeness though.
Would this TypedArray be expected to remain "live" as memory grows?
Yeah, the idea was it would remain live exactly as typed arrays already do (no new features there).
I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access
Most overhead would probably be from the GC overhead of creating a new object on each load&store. It's hard to measure that directly since the overhead ends up in GC pauses, is affected by GC optimizations, etc. But this is the exact issue that led to the creation of GC-free entry points for WebGL in WebGL2, which are very useful. (It's potentially even more significant here, since there it saves creating a view per GL call, while here it could be a view per load and store.)
It is an unfortunate problem to work around. A couple of naive questions:
Oh, that's an interesting point. WebAssembly is always little-endian, but TypedArrays are platform byte order. So using normal views is already going to be incorrect behavior.
Oh, I didn't realize this. That makes the DataView-analogous API a reasonable interface on the JS side (as long as we're OK with the flip in endianness).
Yeah, the idea was it would remain live exactly as typed arrays already do (no new features there).
OK, in this case, I don't really understand the purpose of this part of the proposal--this seems like a way to get around a single access to the ArrayBuffer, but it still forces you to allocate a new TypedArray. It's easier to understand why you would want the load/store methods (since that's a bit of indirection and allocation avoided, and totally solves the growth synchronization problem).
Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.
You could do this, but it would only help if you make sure you return to the event loop.
Surely JS engines could optimize small WASM functions that act directly on memory that are effectively just load calls, to avoid the full context switching costs here in future?
Yes, that seems possible. But it's not trivial -- ultimately you want these to optimize down to a single (perhaps bounds-checked) load, but at least for v8 there's a lot of stuff the optimizer would have to remove to make that happen.
That makes the DataView-analogous API a reasonable interface on the JS side (as long as we're OK with the flip in endianness).
Ha, I see! The default for DataView is big-endian. Might be a bit confusing if we reuse the naming convention there, but it seems like a small risk to me.
I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.
I would also be interesting to see data comparing a non-growable-memory app (that just uses unchanging cached views vs. a growable-memory version of the same app that polyfills the abovementioned Memory
methods by caching the views/buffer and thus guards the load/store methods with if (cachedBuffer === this.buffer)
.
In the common case, the buffer doesn't change and so the overhead is just that of calling the Memory.prototype.buffer
getter which could itself be inlined (and that would actually simpler than inlining a family of new load/store methods).
Sure, here are some numbers. The polyfill and benchmark files are in this gist. The benchmark calls emscripten's AsciiToString function repeatedly, which is a function that does many memory reads, so in practice we replace the line
var ch = HEAP8[ptr++ >> 0];
with
var ch = GROWABLE_HEAP_LOAD_I8(ptr++ >> 0 | 0) | 0;
The latter is 2.9x slower on v8.
As Luke requested this app doesn't actually grow memory, so a new view is never created - the overhead comes from the function call and the check.
Thanks Alon, that's really helpful. So I think the next step would be to see what the performance was like if the buffer
accessor was inlined (which is of course much more work, but so is spec'ing and inlining N new Memory
methods...).
By inlined, do you mean something like
var ch = HEAP8[ptr++ >> 0];
==>
var ch = (wasmMemory.buffer != buffer ? buffer = wasmMemory.buffer, updateGlobalBufferViews() : 0), HEAP8[ptr++ >> 0];
?
With
var ch = ((wasmMemory.buffer != buffer ? (buffer = wasmMemory.buffer, updateGlobalBufferViews()) : 0), HEAP8[ptr++ >> 0]);
it's about the same speed, so, still 2.9x slower. (Which I think makes sense as I assume it inlines eventually in such a hot loop.)
Oh, no, I mean the engine-implemented C++ accessor function for buffer
that is ultimately getting called by the JS engine when you call .buffer
. It's probably not hard to inline, but I expect no engine has done so yet.
Related: https://github.com/WebAssembly/design/issues/1210 which proposes an event handler.
Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.
You could do this, but it would only help if you make sure you return to the event loop.
Couldn't the event be dispatched synchronously?
Unfortunately, in the shared-memory case, the grow can happen completely racily (due to grow in another thread).
Oh, no, I mean the engine-implemented C++ accessor function for
buffer
that is ultimately getting called by the JS engine when you call.buffer
. It's probably not hard to inline, but I expect no engine has done so yet.
Do TypedArray .length accessors get inlined? I get much better performance if build benchmark.cpp
from https://gist.github.com/kripken/a9d90e7cfaed9d210aff4fd0be98217d with GROWABLE_HEAP_LOAD_U8
replaced with this:
function GROWABLE_HEAP_LOAD_U8(ptr) {
if (ptr>=HEAPU8.length) {
if (wasmMemory.buffer != buffer) {
buffer = wasmMemory.buffer;
updateGlobalBufferViews();
}
}
return HEAPU8[ptr];
}
ie only check if the buffer needs to grow if the program actually attempts to access out of bound memory.
I tested on Emscripten 1.38.31 building with:
em++ -O2 --std=c++11 -s USE_PTHREADS=1 -s ALLOW_MEMORY_GROWTH=1 -s WASM_MEM_MAX=64MB benchmark.cpp -o benchmark.htm
On my system (with Chrome 74) it took about 8 seconds with the exiting implementation of GROWABLE_HEAP_LOAD_U8
and about 2 seconds with my implementation.
emcc:WARNING: ignoring unsupported linker flag: --gc-sections
shared:ERROR: invalid forced library: libc++abi
when compile opencv.js
I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.
I would also be interesting to see data comparing a non-growable-memory app (that just uses unchanging cached views vs. a growable-memory version of the same app that polyfills the abovementioned
Memory
methods by caching the views/buffer and thus guards the load/store methods withif (cachedBuffer === this.buffer)
.In the common case, the buffer doesn't change and so the overhead is just that of calling the
Memory.prototype.buffer
getter which could itself be inlined (and that would actually simpler than inlining a family of new load/store methods).
If gather the case correctly, have encountered that recently. If not, then kindly direct to a common solution for the problem.
The use case is stream audio data from ReadableStream
from fetch()
Response.body
to a Uint8Array
backed by a SharedArrayBuffer
that is posted to an AudioWorkletProcessor
and playing back the audio during the process.
At Chromium 85 Content-Length
header is exposed.
const memory = new WebAssembly.Memory({
initial: Math.floor(length / 65536) + 1,
maximum: Math.floor(length / 65536) + 1,
shared: true,
});
const { headers, body } = await fetch(url, { cache: 'no-store' });
// Firefox does not expose Content-Length header
// https://bugzilla.mozilla.org/show_bug.cgi?id=1654212
const length = +headers.get('content-length') || 65536 * 4436;
where the page size can be set at Memory()
constructor, working code https://github.com/guest271314/AudioWorkletStream/blob/shared-memory-audio-worklet-stream/index.html, save for the necessity to hard code maximum
, where the stream could potentially include more than just the first fetch()
request, at some time in the future, to concatenate to the current audio output stream.
We then read data from the ReadableStream
and write data to a single Uint8Array
on the main thread. On the AudioWorkletGlobalScope
thread we write the data to Uint8Array
, pass to Uint16Array
then create Float32Array
to audio output.
Ideally, we should be able to begin with initial:1
and use grow(1)
when offset increases. However, have not found a way through testing to achieve that. RangeError
s are thrown on the TypedArray
when grow(1)
is executed in the middle of a read when the SharedArrayBuffer
is already set at a Uint8Array
. TypedArray
s do not dynamically grow when the underlying buffer source grows. Attempting to write data 1
byte at a time to the shared memory a Unit8Array
at a time, in both contexts, also throws a RangeError
.
From the first link at the answer to How are arrays implemented in javascript? What happened to the good old lists?
In fact, if you’re doing mathematical operations on an array of numbers, consider using a TypedArray. We have specialized elements kinds for those, too.
A potential solution for the above case could be a TypedArray
that grows dynamically when grow()
is executed, to avoid RangeError
that is printed at most recent attempt to dynamically write to and read from shared memory via multiple TypedArray
s
<!DOCTYPE html>
<html>
<head>
<title>
fetch() => WAV => ReadableStream => WritableStream => SharedArrayBuffer =>
AudioWorklet
</title>
</head>
<body>
<button>Resume</button><button>Suspend</button>
<script type="worklet" id="smaws">
class SharedMemoryAudioWorkletStream extends AudioWorkletProcessor {
constructor(options) {
super();
Object.assign(this, options.processorOptions);
this.port.onmessage = e => {
this.uint8_sab = e.data;
console.log(sampleRate, currentTime, currentFrame, this.offset, this.length, this.uint8_sab);
};
}
process(inputs, outputs) {
const channels = outputs.flat();
if (this.offset >= this.length) {
console.log(currentTime, currentFrame, this.offset, this.length);
this.port.postMessage('audio worklet stream done');
return false;
}
const uint8 = new Uint8Array(512);
// console.log(uint8);
// return false;
// const buffer =
// console.log(buffer);
// return false;
for (let i = 0; i < 512; i++, this.offset++) {
uint8[i] = new Uint8Array(this.uint8_sab, this.offset, this.offset + 1)[0];
}
const uint16 = new Uint16Array(uint8.buffer);
// https://stackoverflow.com/a/35248852
for (let i = 0, j = 0, n = 1; i < uint16.length; i++) {
const int = uint16[i];
// If the high bit is on, then it is a negative number, and actually counts backwards.
const float = int >= 0x8000 ? -(0x10000 - int) / 0x8000 : int / 0x7fff;
// interleave
channels[n = ++n % 2][!n ? j++ : j-1] = float;
};
return true;
}
};
registerProcessor(
'shared-memory-audio-worklet-stream',
SharedMemoryAudioWorkletStream
);
</script>
<script>
(async _ => {
try {
if (globalThis.gc) {
gc();
}
const [resume, suspend] = document.querySelectorAll('button');
resume.onclick = async _ => await ac.resume();
suspend.onclick = async _ => await ac.suspend();
const url =
'https://ia800301.us.archive.org/10/items/DELTAnine2013-12-11.WAV/Deltanine121113Pt3Wav.wav';
const worklet = URL.createObjectURL(
new Blob([document.getElementById('smaws').textContent], {
type: 'text/javascript',
})
);
const { headers, body, ok } = await fetch(url, { cache: 'no-store' });
console.log(ok);
const length = +headers.get('content-length');
const ac = new AudioContext({
latencyHint: 1,
sampleRate: 44100,
numberOfChannels: 2,
});
await ac.suspend();
await ac.audioWorklet.addModule(worklet);
const aw = new AudioWorkletNode(
ac,
'shared-memory-audio-worklet-stream',
{
numberOfInputs: 1,
numberOfOutputs: 2,
channelCount: 2,
processorOptions: { length, offset: 0 },
}
);
aw.onprocessorerror = e => {
console.error(e);
console.trace();
};
aw.port.onmessage = async e => {
console.log(e.data);
await ac.suspend();
};
aw.connect(ac.destination);
const page_size = 65536;
let page_byte_count = 0;
const memory = new WebAssembly.Memory({
initial: 1,
maximum: Math.floor(length / page_size) + 1,
shared: true,
});
// const uint8_sab = new Uint8Array(memory.buffer);
aw.port.postMessage(memory.buffer);
let offset = 0;
const reader = body.getReader();
const stream = await (async function process({ value, done }) {
if (done) {
await reader.closed;
return;
}
let len = value.length;
console.log(len);
// let uint8 = new Uint8Array(memory.buffer, offset, value.length);
for (let i = !offset ? 44 : 0; i < value.length; i++) {
const uint8 = new Uint8Array(memory.buffer, offset, offset + 1);
uint8[0] = value[i];
++offset;
page_byte_count++;
if (page_byte_count === page_size) {
// uint8 = new Uint8Array(memory.buffer, offset, len - i);
memory.grow(1);
page_byte_count = 0;
}
}
console.log(memory.buffer.byteLength);
// return;
return process(await reader.read());
})(await reader.read());
console.log('read/write done', stream);
/*
await body.pipeTo(
new WritableStream({
write(value) {
for (let i = !offset ? 44 : 0; i < value.length; i++) {
uint8_sab[offset++] = value[i];
}
},
})
);
console.log('read/write done');
*/
} catch (e) {
console.error(e);
throw e;
}
})().catch(console.trace);
</script>
</body>
</html>
or a TransformStream
where the readable
side are Uint8Array
or other TypedArray
streamed sequentially, backed by a single shared memory instance. A combination of transferable streams (https://docs.google.com/document/d/1_KuZzg5c3pncLJPFa8SuVm23AP4tft6mzPCL5at3I9M/edit; https://bugs.chromium.org/p/chromium/issues/detail?id=894838) and Memory.grow()
with each chunk of data in the TypedArray
being the next offset of the underlying shred memory.
Perhaps there is an existing solution for this case if so, kindly illuminate.
FYI: @syg is proposing a new ResizableArrayBuffer
and GrowableSharedArrayBuffer
to help with this problem, see https://github.com/syg/proposal-resizablearraybuffer.
@kripken I was playing around with your benchmark, and i think i came up with an option to reduce the overhead of warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code slowly
Instead of always relying on the array views, have a native function to call into which can cast the memory address, since this is on the wasm side no need to check if the underlying buffer has changed.
Swapping out AsciiToString with:
function AsciiToString(ptr){
var str="";
while(1) {
var ch=Module['_view_uint8'](ptr++>>0) >>> 0;
if(!ch) return str;
str+=String.fromCharCode(ch)
}
}
@arsnyder16 Interesting, thanks, I'm impressed and surprised that calling into wasm is faster here. I would have guessed otherwise. But I guess optimizations on the JS side are hard to predict...
@kripken Any progress on this issue ? If this seems like it might not be addressed in the short term.
I am wondering if my AsciiToString
suggestions would make sense to integrate into emscripten as an improvment. So instead of the GROWABLE_HEAP functions make a call into a wasm c function?
@arsnyder16 Let's move that topic to an issue on the emscripten tracker? (as this is the wasm design repo, and your idea is an implementation detail) But in general it sounds interesting, if we see clear performance benefits.
@kripken Will do.
Hey! Is there any update on this? :) Thanks!
In the last meeting we briefly discussed this, and it was suggested that I write up more details of the problem and some possible solutions. I did that here:
https://gist.github.com/kripken/949eab99b7bc34f67c12140814d2b595
Feedback welcome!