Closed PABannier closed 2 months ago
The forward pass is taking a very long time. The generation speed to generate a token for Bark-small is 10x longer than when compiled in C++ (5ms vs. 65ms for the semantic encoder for instance). I passed the -O3
flag when compiling with Emscripten. I expected the WASM version to be slower than C++, but not by a factor of 10/20x.
@ggerganov Have you observed something similar with Whisper? Would you happen to have any ideas on speeding up the inference pass?
There are WASM_SIMD
implementations only for the matrix multiplication op. If there are other ops in Bark that require significant compute, they might become bottleneck when using WASM
Overall, WASM performance is not great. For example, I compare whisper small
encoder on my M1 Pro, using only CPU (i.e. without Metal and without Accelerate CBLAS) and the C++ version is 10x faster than the web-version:
WHISPER_NO_METAL=1 WHISPER_NO_ACCELERATE=1 make -j
./bench -m models/ggml-small.bin -t 8
whisper_print_timings: encode time = 1255.23 ms / 1 runs ( 1255.23 ms per run)
The web-version is here: https://whisper.ggerganov.com/bench/
whisper_print_timings: encode time = 13947.06 ms / 1 runs (13947.06 ms per run)
@ggerganov Understood. So using Bark with WASM is probably not the right idea as the model is still very computationally intensive. I'll focus on supporting Metal and cuBLAS then. Thanks!
Close #154