Open monoto opened 3 years ago
Is the code sample small enough per chance to post/examine it in a github issue?
Neither embind and EM_ASM are intended to be used for high performance interop - frequent jumping between Wasm<->JS will nuke performance rather badly. For fastest inter-thread communication, using pthread mutexes/task queues and other synchronization primitives will be fastest.
Polling on an Atomics from the main thread's requestAnimationFrame could cut down on the delay. Let me try that and report back.
My test case turned out to be ill-conceived. I launched threaded test and immediately run the Javascript test. So effectively, they are running in parallel taking resources away from each others. Here is a better result, for 8 million vector and matrix multiplications:
Pure Javascript: 285 ms, SIMD: 180 ms, SIMD + 8 threads (measured in c++): 55 ms, SIMD + 8 threads (end-to-end in JS): 72 ms,
The overhead from (1) + (2) is around 15 ~16 ms consistently which is quite reasonable.
However, one thing worth noting, if main thread is busy, MAIN_THREAD_ASYNC_EM_ASM can get delayed significantly as indicated by my previous test result.
Unless Browser Vendors use system interrupt to give MAIN_THREAD_ASYNC_EM_ASM highest priority on the main thread. I don't see what else emscripten can do in the mean time?
I created a webassembly module for accelerating vector math using SIMD and multi-threading. Multiplying 8 millions vector4 and matrix4 reveals that most time was spent on (1) JS calling C++ embind-exported function and (2) pThread communicates with main thread using MAIN_THREAD_ASYNC_EM_ASM.
To give you a idea how much time, here is a result of a sample run:
Pure Javascript 304ms, SIMD 180ms, SIMD with 4 threads (measured in c++) 73ms SIMD with 8 threads (measured in c++) 50ms SIMD with 4 threads (measured in Javascript end-to-end) 323ms
A significant amount of time (250ms) is spent on (1) and (2). Is there current effort to improve performance of cross-boundary and inter-thread communication?
I know SharedArrayBuffer is almost instantaneous, but we need a faster way for signaling besides polling.
Thank you.