Closed loretoparisi closed 6 months ago
Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.
Do you have a plan for getting around that?
If you quantized the 7B model to a mixture of 3-bit and 4-bit quantization using https://github.com/qwopqwop200/GPTQ-for-LLaMa then you could stay within that memory envelope.
I think that's a reasonable proposal @Dicklesworthstone.
A purely 3-bit implementation of llama.cpp using GPTQ could retain acceptable performance and solve the same memory issues. There's an open issue for implementing GPTQ quantization in 3-bit and 4-bit. GPTQ Quantization (3-bit and 4-bit) #9.
Other use cases could benefit from this same enhancement, such as getting 65B under 32GB and 30B under 16GB to further extend access to (perhaps slightly weaker versions of) the larger models.
https://twitter.com/nJoyneer/status/1637863946383155220
I was able to run llama.cpp
in the browser with a minimal patchset and some *FLAGS
Following the Emscripten version used:
$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Following the compile flags:
make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html
Following the minimal patch:
diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
clean:
- rm -f *.o main quantize
+ rm -f *.o main.{html,wasm,js,data,worker.js} main quantize
main: main.cpp ggml.o utils.o
$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
./main -h
+main.html: main.cpp ggml.o utils.o
+ $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+ go run server.go
+
quantize: quantize.cpp ggml.o utils.o
$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
#include <alloca.h>
#endif
+#define _POSIX_C_SOURCE 200809L
+
#include <assert.h>
#include <time.h>
#include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
do { \
if (!(x)) { \
fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
- abort(); \
+ /*abort();*/ \
} \
} while (0)
diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
const int64_t t_main_start_us = ggml_time_us();
gpt_params params;
- params.model = "models/llama-7B/ggml-model.bin";
+ params.model = "models/7B/ggml-model-q4_0.bin";
if (gpt_params_parse(argc, argv, params) == false) {
return 1;
I was able to run
llama.cpp
in the browser with a minimal patchset and some*FLAGS
Following the Emscripten version used:
$ emcc --version emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt) This is free and open source software under the MIT license. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Following the compile flags:
make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html
Following the minimal patch:
diff --git a/Makefile b/Makefile index 1601079..12a1a80 100644 --- a/Makefile +++ b/Makefile @@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h $(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o clean: - rm -f *.o main quantize + rm -f *.o main.{html,wasm,js,data,worker.js} main quantize main: main.cpp ggml.o utils.o $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS) ./main -h +main.html: main.cpp ggml.o utils.o + $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS) + go run server.go + quantize: quantize.cpp ggml.o utils.o $(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS) diff --git a/ggml.c b/ggml.c index 4813f74..3dc2cbc 100644 --- a/ggml.c +++ b/ggml.c @@ -6,6 +6,8 @@ #include <alloca.h> #endif +#define _POSIX_C_SOURCE 200809L + #include <assert.h> #include <time.h> #include <math.h> @@ -107,7 +109,7 @@ typedef void* thread_ret_t; do { \ if (!(x)) { \ fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \ - abort(); \ + /*abort();*/ \ } \ } while (0) diff --git a/main.cpp b/main.cpp index e181056..afb0c53 100644 --- a/main.cpp +++ b/main.cpp @@ -785,7 +785,7 @@ int main(int argc, char ** argv) { const int64_t t_main_start_us = ggml_time_us(); gpt_params params; - params.model = "models/llama-7B/ggml-model.bin"; + params.model = "models/7B/ggml-model-q4_0.bin"; if (gpt_params_parse(argc, argv, params) == false) { return 1;
Wow well done! Why did you had to remove abort();
from ggml?
The abort();
case was hit when out of memory, and before outputting the partial LLM output, so no string was shown.
Unfortunately, I did not find any way of using Memory64
WASM extension without incurring in some bugs on both Firefox and Chrome.
The
abort();
case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of usingMemory64
WASM extension without incurring in some bugs on both Firefox and Chrome.
So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.
The
abort();
case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of usingMemory64
WASM extension without incurring in some bugs on both Firefox and Chrome.So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.
It's already quantized 4bits when converting. 7b
overflows 8GB of allocated WASM64 memory though.
Besides that it's quite slow since there is not both (pthread
OR simd
) AND memory64
. Whenever I try to mix any of the two, the compiler or the linter fails to create or run the output.
@thypon apparently memory64 is available in firefox nightly, did you check it ?
The new RedPajama-3B seems like a nice tiny model that could probably fit without memory64.
@thypon @loretoparisi I'm curious, what sort of performance drop did you notice running in browser from running natively? How many toks/sec were you getting?
@IsaacRe Did not make a performance comparison since it was not 100% stable and needed to be refined. As mentioned it was single core since multithreaded + memory64 on firefox nightly was not working properly together, and crashing the experiment.
@okpatil4u already running with experimental memory64
Hey @thypon, did you make any progress on this experiment ?
I'm not actively working on this at the current stage.
@okpatil4u I broadly followed the very useful steps above by @loretoparisi and was able to run really small models only with latest Emscripten and a fairly recent master commit. I am way out of my comfort zone with C++ or WASM (since I spent most of my time with Typescript and Python). I didn't get around to installing Firefox nightly and stopped for now. I last had the tiny Shakespeare models running in the browser. The diff by loretoparisi was made a while ago so I had to make some significant changes and I am a complete C++ noob, so take this with a bag of salt, but if it helps someone, great: https://github.com/lukestanley/llama.cpp/commit/41cbd2b8aff97c7297b3dbddbad6dfbb0f164380
I've tried the approach suggested by @lukestanley and @loretoparisi and got starcoder.cpp to run on browser. Published a demo project on this: https://github.com/rahuldshetty/starcoder.js
I tried with tiny_starcoder_py model as the weight size were quite small to fit without mem64, and tried to see the performance/accuracy. It seems like the output of the model without mem64 is gibberish while mem64 version results in meaningful output. Not sure if memory addressing in 32bit vs 64bit has to do with it.
Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.
Do you have a plan for getting around that?
How about WebGPU? Probably better to run it off-CPU where possible anyhow?
(full disclosure: I have no idea what I'm talking about.)
Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.
Do you have a plan for getting around that?
The implementation status is complete for emscripten: https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md
This issue was closed because it has been inactive for 14 days since being marked as stale.
Not sure what is the progress here, apparently there are overlapping or related opened issues.
There is this project that might be relevant: https://github.com/ngxson/wllama
@ggerganov Thanks for sharing that. I'm already using https://github.com/tangledgroup/llama-cpp-wasm as the basis of a big project.
So far llama-cpp-wasm has allowed me to run pretty much any .gguf that is less than 2GB in size in the browser (and that limitation seems to be related to the caching mechanism of that project, so I suspect the real limit would be 4GB).
People talk about bringing AI to the masses, but the best way to do that is with browser-based technology. My mom is never going to install Ollama and the like.
My mom is never going to install Ollama and the like.
But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference
She's already doing it :-)
Sneak preview:
(100% browser based)
My mom is never going to install Ollama and the like.
But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference
Agreed, the best example so far is LLM MLC, web version: https://webllm.mlc.ai/
you can see that it can download 4GB in shards like 20 shards or so for a Llama-2 7B weights, 4-bit quantized. Of course this means that you can wait from tens seconds to few minutes to start the inference. And this is not going to change soon unless the quantization at 3,2bits works better and the accuracy is good as the 4bit...
By example, if we take Llama-2, 8B we have 108 shards
and it took 114 seconds to complete on my fiber channel:
before being ready to infer
on Mac M1 Pro I get
prefill: 13.5248 tokens/sec, decoding: 7.9857 tokens/sec
Models with “-1k” suffix signify 1024 context length, lowering ~2-3GB VRAM requirement compared to their counterparts. Feel free to start trying with those.
Huggingface has recently released a streaming option for GGUF, where you can already start inference even though the model is noy fully loaded yet. At least, that's my understanding from a recent Youtube video by Yannic Kilcher.
For my project I'm trying to use a less than 2Gb quant of Phi 2 with 128K context. I think that model will become the best model for browser-based used for a while.
You may be thinking of a library that Huggingface released that can read GGUF metadata without downloading the whole file. You wouldn't gain much from streaming the model for inference, generally the entire model is necessary to generate every token.
@slaren Ah, thanks for clarifying that. It sounded a little too good to be true :-)
Hello I have tried a minimal Emscripten support to
Makefile
addingIt complies ok with both
em++
andemcc
. At this stage the problem is thatmain.cpp
andquantize.cpp
does not expose a proper header file, and I cannot callmain
as a module, or a function export using EmscriptenEMSCRIPTEN_KEEPALIVE
to main by example.In fact a simple C++ headers could be compiled as a node module and then called like
and then executed in node scripts like