Threads + memory growth + JS

kripken commented 5 years ago

In the last meeting we briefly discussed this, and it was suggested that I write up more details of the problem and some possible solutions. I did that here:

https://gist.github.com/kripken/949eab99b7bc34f67c12140814d2b595

Feedback welcome!

binji commented 5 years ago

@lukewagner @littledan you expressed some interest in this.

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.

As for view8_s(start, end), etc. This would be identical to new Int8Array(memory.buffer, start, end - start), right? Seems like these functions aren't as useful as the load/store ones.

jgravelle-google commented 5 years ago

Not knowing enough about the total design space, I really like having typed loads on the memory object itself. A question I have is, would it be reasonable to put these load methods on ArrayBuffer directly? Are there semantic differences between an ArrayBuffer and a WebAssembly.Memory? It feels like an API inconsistency to me to have these scoped to only wasm.

binji commented 5 years ago

@jgravelle-google Seems unlikely that ES would want that, since this functionality is already provided by DataView.

littledan commented 5 years ago

This proposal sounds good to me, at a high level.

This would be the first time that WebAssembly text format instruction names would show up in JavaScript AFAIK. Bikeshedding, for the load and store API, I'm wondering, what if we gave WebAssembly.Memory methods of the same name and signature as a DataView? Or, maybe this doesn't make sense, since these methods should use native endianness, but DataViews are always based on an explicit endianness choice.

I don't understand the view API. Is this about making a TypedArray over the WebAssembly.Memory? Would this TypedArray be expected to remain "live" as memory grows?

binji commented 5 years ago

Or, maybe this doesn't make sense, since these methods should use native endianness, but DataViews are always based on an explicit endianness choice.

Oh, that's an interesting point. WebAssembly is always little-endian, but TypedArrays are platform byte order. So using normal views is already going to be incorrect behavior.

Would this TypedArray be expected to remain "live" as memory grows?

I think they're meant to be normal TypedArrays, so their size won't change.

kripken commented 5 years ago

As for view8_s(start, end), etc. This would be identical to new Int8Array(memory.buffer, start, end - start), right? Seems like these functions aren't as useful as the load/store ones.

Yeah, good point, the view API here is less useful. I forgot about the typed array constructor with 3 parameters - and even without that it's not a huge win, I guess. It does seem kind of nice for completeness though.

Would this TypedArray be expected to remain "live" as memory grows?

Yeah, the idea was it would remain live exactly as typed arrays already do (no new features there).

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access

Most overhead would probably be from the GC overhead of creating a new object on each load&store. It's hard to measure that directly since the overhead ends up in GC pauses, is affected by GC optimizations, etc. But this is the exact issue that led to the creation of GC-free entry points for WebGL in WebGL2, which are very useful. (It's potentially even more significant here, since there it saves creating a view per GL call, while here it could be a view per load and store.)

guybedford commented 5 years ago

It is an unfortunate problem to work around. A couple of naive questions:

Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.
Surely JS engines could optimize small WASM functions that act directly on memory that are effectively just load calls, to avoid the full context switching costs here in future?

littledan commented 5 years ago

Oh, that's an interesting point. WebAssembly is always little-endian, but TypedArrays are platform byte order. So using normal views is already going to be incorrect behavior.

Oh, I didn't realize this. That makes the DataView-analogous API a reasonable interface on the JS side (as long as we're OK with the flip in endianness).

Yeah, the idea was it would remain live exactly as typed arrays already do (no new features there).

OK, in this case, I don't really understand the purpose of this part of the proposal--this seems like a way to get around a single access to the ArrayBuffer, but it still forces you to allocate a new TypedArray. It's easier to understand why you would want the load/store methods (since that's a bit of indirection and allocation avoided, and totally solves the growth synchronization problem).

binji commented 5 years ago

Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.

You could do this, but it would only help if you make sure you return to the event loop.

Surely JS engines could optimize small WASM functions that act directly on memory that are effectively just load calls, to avoid the full context switching costs here in future?

Yes, that seems possible. But it's not trivial -- ultimately you want these to optimize down to a single (perhaps bounds-checked) load, but at least for v8 there's a lot of stuff the optimizer would have to remove to make that happen.

That makes the DataView-analogous API a reasonable interface on the JS side (as long as we're OK with the flip in endianness).

Ha, I see! The default for DataView is big-endian. Might be a bit confusing if we reuse the naming convention there, but it seems like a small risk to me.

lukewagner commented 5 years ago

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.

I would also be interesting to see data comparing a non-growable-memory app (that just uses unchanging cached views vs. a growable-memory version of the same app that polyfills the abovementioned Memory methods by caching the views/buffer and thus guards the load/store methods with if (cachedBuffer === this.buffer).

In the common case, the buffer doesn't change and so the overhead is just that of calling the Memory.prototype.buffer getter which could itself be inlined (and that would actually simpler than inlining a family of new load/store methods).

kripken commented 5 years ago

Sure, here are some numbers. The polyfill and benchmark files are in this gist. The benchmark calls emscripten's AsciiToString function repeatedly, which is a function that does many memory reads, so in practice we replace the line

  var ch = HEAP8[ptr++ >> 0];

with

  var ch = GROWABLE_HEAP_LOAD_I8(ptr++ >> 0 | 0) | 0;

The latter is 2.9x slower on v8.

As Luke requested this app doesn't actually grow memory, so a new view is never created - the overhead comes from the function call and the check.

lukewagner commented 5 years ago

Thanks Alon, that's really helpful. So I think the next step would be to see what the performance was like if the buffer accessor was inlined (which is of course much more work, but so is spec'ing and inlining N new Memory methods...).

kripken commented 5 years ago

By inlined, do you mean something like

  var ch = HEAP8[ptr++ >> 0];
==>
  var ch = (wasmMemory.buffer != buffer ?  buffer = wasmMemory.buffer, updateGlobalBufferViews() : 0), HEAP8[ptr++ >> 0];

?

kripken commented 5 years ago

With

  var ch = ((wasmMemory.buffer != buffer ? (buffer = wasmMemory.buffer, updateGlobalBufferViews()) : 0), HEAP8[ptr++ >> 0]);

it's about the same speed, so, still 2.9x slower. (Which I think makes sense as I assume it inlines eventually in such a hot loop.)

lukewagner commented 5 years ago

Oh, no, I mean the engine-implemented C++ accessor function for buffer that is ultimately getting called by the JS engine when you call .buffer. It's probably not hard to inline, but I expect no engine has done so yet.

dcodeIO commented 5 years ago

Related: https://github.com/WebAssembly/design/issues/1210 which proposes an event handler.

Macil commented 5 years ago

Would it be possible for JS to listen to the memory "change" event, and update any buffers in an event listener there? That seems like it might avoid the view-per-load problem.

You could do this, but it would only help if you make sure you return to the event loop.

Couldn't the event be dispatched synchronously?

lukewagner commented 5 years ago

Unfortunately, in the shared-memory case, the grow can happen completely racily (due to grow in another thread).

cosinusoidally commented 5 years ago

Oh, no, I mean the engine-implemented C++ accessor function for buffer that is ultimately getting called by the JS engine when you call .buffer. It's probably not hard to inline, but I expect no engine has done so yet.

Do TypedArray .length accessors get inlined? I get much better performance if build benchmark.cpp from https://gist.github.com/kripken/a9d90e7cfaed9d210aff4fd0be98217d with GROWABLE_HEAP_LOAD_U8 replaced with this:

function GROWABLE_HEAP_LOAD_U8(ptr) {
  if (ptr>=HEAPU8.length) {
    if (wasmMemory.buffer != buffer) {
      buffer = wasmMemory.buffer;
      updateGlobalBufferViews();
    }
  } 
  return HEAPU8[ptr];
}

ie only check if the buffer needs to grow if the program actually attempts to access out of bound memory.

I tested on Emscripten 1.38.31 building with:

em++ -O2 --std=c++11 -s USE_PTHREADS=1 -s ALLOW_MEMORY_GROWTH=1 -s WASM_MEM_MAX=64MB benchmark.cpp -o benchmark.htm

On my system (with Chrome 74) it took about 8 seconds with the exiting implementation of GROWABLE_HEAP_LOAD_U8 and about 2 seconds with my implementation.

JsBlueCat commented 5 years ago

emcc:WARNING: ignoring unsupported linker flag: --gc-sections shared:ERROR: invalid forced library: libc++abi

when compile opencv.js

guest271314 commented 4 years ago

I'd be curious to see some quick numbers about how slow it is to create a new view for every memory access, and for calling into wasm for each memory access too.

I would also be interesting to see data comparing a non-growable-memory app (that just uses unchanging cached views vs. a growable-memory version of the same app that polyfills the abovementioned Memory methods by caching the views/buffer and thus guards the load/store methods with if (cachedBuffer === this.buffer).

In the common case, the buffer doesn't change and so the overhead is just that of calling the Memory.prototype.buffer getter which could itself be inlined (and that would actually simpler than inlining a family of new load/store methods).

If gather the case correctly, have encountered that recently. If not, then kindly direct to a common solution for the problem.

The use case is stream audio data from ReadableStream from fetch() Response.body to a Uint8Array backed by a SharedArrayBuffer that is posted to an AudioWorkletProcessor and playing back the audio during the process.

At Chromium 85 Content-Length header is exposed.

      const memory = new WebAssembly.Memory({
        initial: Math.floor(length / 65536) + 1,
        maximum: Math.floor(length / 65536) + 1,
        shared: true,
      });
      const { headers, body } = await fetch(url, { cache: 'no-store' });
      // Firefox does not expose Content-Length header
      // https://bugzilla.mozilla.org/show_bug.cgi?id=1654212
      const length = +headers.get('content-length') || 65536 * 4436;

where the page size can be set at Memory() constructor, working code https://github.com/guest271314/AudioWorkletStream/blob/shared-memory-audio-worklet-stream/index.html, save for the necessity to hard code maximum, where the stream could potentially include more than just the first fetch() request, at some time in the future, to concatenate to the current audio output stream.

We then read data from the ReadableStream and write data to a single Uint8Array on the main thread. On the AudioWorkletGlobalScope thread we write the data to Uint8Array, pass to Uint16Array then create Float32Array to audio output.

Ideally, we should be able to begin with initial:1 and use grow(1) when offset increases. However, have not found a way through testing to achieve that. RangeErrors are thrown on the TypedArray when grow(1) is executed in the middle of a read when the SharedArrayBuffer is already set at a Uint8Array. TypedArrays do not dynamically grow when the underlying buffer source grows. Attempting to write data 1 byte at a time to the shared memory a Unit8Array at a time, in both contexts, also throws a RangeError.

From the first link at the answer to How are arrays implemented in javascript? What happened to the good old lists?

In fact, if you’re doing mathematical operations on an array of numbers, consider using a TypedArray. We have specialized elements kinds for those, too.

A potential solution for the above case could be a TypedArray that grows dynamically when grow() is executed, to avoid RangeError that is printed at most recent attempt to dynamically write to and read from shared memory via multiple TypedArrays

<!DOCTYPE html>

<html>
  <head>
    <title>
      fetch() => WAV => ReadableStream => WritableStream => SharedArrayBuffer =>
      AudioWorklet
    </title>
  </head>

  <body>
    <button>Resume</button><button>Suspend</button>
    <script type="worklet" id="smaws">
      class SharedMemoryAudioWorkletStream extends AudioWorkletProcessor {
        constructor(options) {
          super();
          Object.assign(this, options.processorOptions);
          this.port.onmessage = e => {
            this.uint8_sab = e.data;
            console.log(sampleRate, currentTime, currentFrame, this.offset, this.length, this.uint8_sab);
          };
        }
        process(inputs, outputs) {
          const channels = outputs.flat();
          if (this.offset >= this.length) {
            console.log(currentTime, currentFrame, this.offset, this.length);
            this.port.postMessage('audio worklet stream done');
            return false;
          }
          const uint8 = new Uint8Array(512);
          // console.log(uint8);
          // return false;
          // const buffer =
          // console.log(buffer);
          // return false;
          for (let i = 0; i < 512; i++, this.offset++) {
            uint8[i] = new Uint8Array(this.uint8_sab, this.offset, this.offset + 1)[0];
          }
          const uint16 = new Uint16Array(uint8.buffer);
          // https://stackoverflow.com/a/35248852
          for (let i = 0, j = 0, n = 1; i < uint16.length; i++) {
            const int = uint16[i];
            // If the high bit is on, then it is a negative number, and actually counts backwards.
            const float = int >= 0x8000 ? -(0x10000 - int) / 0x8000 : int / 0x7fff;
            // interleave
            channels[n = ++n % 2][!n ? j++ : j-1] = float;
          };
          return true;
        }
      };
      registerProcessor(
        'shared-memory-audio-worklet-stream',
        SharedMemoryAudioWorkletStream
      );
    </script>
    <script>
      (async _ => {
        try {
          if (globalThis.gc) {
            gc();
          }
          const [resume, suspend] = document.querySelectorAll('button');
          resume.onclick = async _ => await ac.resume();
          suspend.onclick = async _ => await ac.suspend();
          const url =
            'https://ia800301.us.archive.org/10/items/DELTAnine2013-12-11.WAV/Deltanine121113Pt3Wav.wav';
          const worklet = URL.createObjectURL(
            new Blob([document.getElementById('smaws').textContent], {
              type: 'text/javascript',
            })
          );
          const { headers, body, ok } = await fetch(url, { cache: 'no-store' });
          console.log(ok);
          const length = +headers.get('content-length');
          const ac = new AudioContext({
            latencyHint: 1,
            sampleRate: 44100,
            numberOfChannels: 2,
          });
          await ac.suspend();
          await ac.audioWorklet.addModule(worklet);
          const aw = new AudioWorkletNode(
            ac,
            'shared-memory-audio-worklet-stream',
            {
              numberOfInputs: 1,
              numberOfOutputs: 2,
              channelCount: 2,
              processorOptions: { length, offset: 0 },
            }
          );
          aw.onprocessorerror = e => {
            console.error(e);
            console.trace();
          };
          aw.port.onmessage = async e => {
            console.log(e.data);
            await ac.suspend();
          };
          aw.connect(ac.destination);
          const page_size = 65536;
          let page_byte_count = 0;
          const memory = new WebAssembly.Memory({
            initial: 1,
            maximum: Math.floor(length / page_size) + 1,
            shared: true,
          });
          // const uint8_sab = new Uint8Array(memory.buffer);
          aw.port.postMessage(memory.buffer);
          let offset = 0;
          const reader = body.getReader();
          const stream = await (async function process({ value, done }) {
            if (done) {
              await reader.closed;
              return;
            }
            let len = value.length;
            console.log(len);
            // let uint8 = new Uint8Array(memory.buffer, offset, value.length);
            for (let i = !offset ? 44 : 0; i < value.length; i++) {
              const uint8 = new Uint8Array(memory.buffer, offset, offset + 1);
              uint8[0] = value[i];
              ++offset;
              page_byte_count++;
              if (page_byte_count === page_size) {
                // uint8 = new Uint8Array(memory.buffer, offset, len - i);
                memory.grow(1);
                page_byte_count = 0;
              }
            }

            console.log(memory.buffer.byteLength);
            // return;
            return process(await reader.read());
          })(await reader.read());

          console.log('read/write done', stream);
          /*
          await body.pipeTo(
            new WritableStream({
              write(value) {
                for (let i = !offset ? 44 : 0; i < value.length; i++) {
                  uint8_sab[offset++] = value[i];
                }
              },
            })
          );
          console.log('read/write done');
          */
        } catch (e) {
          console.error(e);
          throw e;
        }
      })().catch(console.trace);
    </script>
  </body>
</html>

or a TransformStream where the readable side are Uint8Array or other TypedArray streamed sequentially, backed by a single shared memory instance. A combination of transferable streams (https://docs.google.com/document/d/1_KuZzg5c3pncLJPFa8SuVm23AP4tft6mzPCL5at3I9M/edit; https://bugs.chromium.org/p/chromium/issues/detail?id=894838) and Memory.grow() with each chunk of data in the TypedArray being the next offset of the underlying shred memory.

Perhaps there is an existing solution for this case if so, kindly illuminate.

binji commented 4 years ago

FYI: @syg is proposing a new ResizableArrayBuffer and GrowableSharedArrayBuffer to help with this problem, see https://github.com/syg/proposal-resizablearraybuffer.

arsnyder16 commented 2 years ago

@kripken I was playing around with your benchmark, and i think i came up with an option to reduce the overhead of warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code slowly

Instead of always relying on the array views, have a native function to call into which can cast the memory address, since this is on the wasm side no need to check if the underlying buffer has changed.

Here is the new benchmark.cpp

```cpp #include #include #include #include extern "C" { EMSCRIPTEN_KEEPALIVE uint8_t view_uint8(void* address) { return reinterpret_cast(address)[0]; } } int main() { auto start = std::chrono::steady_clock::now(); size_t total = 0; for (const char* str : { "hello", "world", "!", "test", "ing", "strings", "many", "of", "them" }) { total += EM_ASM_INT({ var total = 0; var value; for (var i = 0; i < 1000000; ++i) { value = AsciiToString($0); total += value.length; } console.log(value); return total; }, str); } auto ms = std::chrono::duration_cast(std::chrono::steady_clock::now() - start).count() ; std::cout << total << " in " << ms << " ms " << total/ms << " /ms \n"; } ```

NO_ALLOW_MEMORY_GROWTH

```bash ~/emscripten/benchmark ((3.1.7))# em++ -Os -pthread -o -sNO_ALLOW_MEMORY_GROWTH -o benchmark.js benchmark.cpp ~/emscripten/benchmark ((3.1.7))# node --experimental-wasm-threads --experimental-wasm-bulk-memory benchmark.js hello world ! test ing strings many of them 35000000 in 293 ms 119453 /ms ```

ALLOW_MEMORY_GROWTH

```bash ~/emscripten/benchmark ((3.1.7))# em++ -Os -pthread -sALLOW_MEMORY_GROWTH -o benchmark.js benchmark.cpp em++: warning: USE_PTHREADS + ALLOW_MEMORY_GROWTH may run non-wasm code slowly, see https://github.com/WebAssembly/design/issues/1271 [-Wpthreads-mem-growth] ~/emscripten/benchmark ((3.1.7))# node --experimental-wasm-threads --experimental-wasm-bulk-memory benchmark.js hello world ! test ing strings many of them 35000000 in 1328 ms 26355 /ms ```

Swapping out AsciiToString with:

function AsciiToString(ptr){
  var str="";
  while(1) {
    var ch=Module['_view_uint8'](ptr++>>0) >>> 0;
    if(!ch) return str;
    str+=String.fromCharCode(ch)
  }
}

Results!

```bash ~/emscripten/benchmark ((3.1.7))# node --experimental-wasm-threads --experimental-wasm-bulk-memory benchmark.js hello world ! test ing strings many of them 35000000 in 408 ms 85784 /ms ```

kripken commented 2 years ago

@arsnyder16 Interesting, thanks, I'm impressed and surprised that calling into wasm is faster here. I would have guessed otherwise. But I guess optimizations on the JS side are hard to predict...

arsnyder16 commented 1 year ago

@kripken Any progress on this issue ? If this seems like it might not be addressed in the short term.

I am wondering if my AsciiToString suggestions would make sense to integrate into emscripten as an improvment. So instead of the GROWABLE_HEAP functions make a call into a wasm c function?

kripken commented 1 year ago

@arsnyder16 Let's move that topic to an issue on the emscripten tracker? (as this is the wasm design repo, and your idea is an implementation detail) But in general it sounds interesting, if we see clear performance benefits.

arsnyder16 commented 1 year ago

@kripken Will do.

ghnp5 commented 2 weeks ago

Hey! Is there any update on this? :) Thanks!

WebAssembly / design

Threads + memory growth + JS #1271