ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.34k stars 9.67k forks source link

WebAssembly and emscripten headers #97

Closed loretoparisi closed 6 months ago

loretoparisi commented 1 year ago

Hello I have tried a minimal Emscripten support to Makefile adding

# WASM
EMCXX = em++
EMCC = emcc
EMCXXFLAGS = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun','FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=0" -s 'EXPORT_NAME="LLAMAModule"' -s "USE_ES6_IMPORT_META=0" -I./
EMCCFLAGS = --bind -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun','FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=0" -s 'EXPORT_NAME="LLAMAModule"' -s "USE_ES6_IMPORT_META=0" -I./ 

EMOBJS = utils.bc ggml.bc

wasm: llama_wasm.js quantize_wasm.js
wasmdebug: export EMCC_DEBUG=1
wasmdebug: llama_wasm.js quantize_wasm.js

#
# WASM lib
#

ggml.bc: ggml.c ggml.h
    $(EMCC) -c $(EMCCFLAGS) ggml.c -o ggml.bc
utils.bc: utils.cpp utils.h
    $(EMCXX) -c $(EMCXXFLAGS) utils.cpp -o utils.bc

$(info I EMOBJS:      $(EMOBJS))

#
# WASM executable
#
llama_wasm.js: $(EMOBJS) main.cpp Makefile
    $(EMCXX) $(EMCXXFLAGS) $(EMOBJS) -o llama_wasm.js
quantize_wasm.js: $(EMOBJS) quantize.cpp Makefile
    $(EMCXX) $(EMCXXFLAGS) $(EMOBJS) quantize.cpp -o quantize_wasm.js

It complies ok with both em++ and emcc. At this stage the problem is that main.cpp and quantize.cpp does not expose a proper header file, and I cannot call main as a module, or a function export using Emscripten EMSCRIPTEN_KEEPALIVE to main by example.

In fact a simple C++ headers could be compiled as a node module and then called like

/** file:llama.js */
const llamaModularized = require('./llama_wasm.js');
var llamaModule = null
const _initLLAMAModule = async function () {
    llamaModule = await llamaModularized();
    return true
}
let postRunFunc = null;
const addOnPostRun = function (func) {
    postRunFunc = func;
};
_initLLAMAModule().then((res) => {
    if (postRunFunc) {
        postRunFunc();
    }
});

class LLaMa {
    constructor() {
        this.f = new llamaModule.LLaMa();
    }
    // here modules fun impl
}

module.exports = { LLaMa, addOnPostRun };

and then executed in node scripts like

/** file:run.js */
(async () => {
    const LLaMa = require('./llama.js');
    const loadWASM = function () {
        var self = this;
        return new Promise(function (resolve, reject) {
            LLaMa.addOnPostRun(() => {
                let model = new LLaMa.LLaMa();
                /** use model functions */
            });
        });
    }//loadWASM
    await loadWASM();

}).call(this);
MarkSchmidty commented 1 year ago

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

Dicklesworthstone commented 1 year ago

If you quantized the 7B model to a mixture of 3-bit and 4-bit quantization using https://github.com/qwopqwop200/GPTQ-for-LLaMa then you could stay within that memory envelope.

MarkSchmidty commented 1 year ago

I think that's a reasonable proposal @Dicklesworthstone.

A purely 3-bit implementation of llama.cpp using GPTQ could retain acceptable performance and solve the same memory issues. There's an open issue for implementing GPTQ quantization in 3-bit and 4-bit. GPTQ Quantization (3-bit and 4-bit) #9.

Other use cases could benefit from this same enhancement, such as getting 65B under 32GB and 30B under 16GB to further extend access to (perhaps slightly weaker versions of) the larger models.

thypon commented 1 year ago

https://twitter.com/nJoyneer/status/1637863946383155220

I was able to run llama.cpp in the browser with a minimal patchset and some *FLAGS

Screenshot 2023-03-20 at 17 46 01

Following the Emscripten version used:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following the compile flags:

make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html

Following the minimal patch:

diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
    $(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o

 clean:
-   rm -f *.o main quantize
+   rm -f *.o main.{html,wasm,js,data,worker.js} main quantize

 main: main.cpp ggml.o utils.o
    $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
    ./main -h

+main.html: main.cpp ggml.o utils.o
+   $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+   go run server.go
+
 quantize: quantize.cpp ggml.o utils.o
    $(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)

diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
 #include <alloca.h>
 #endif

+#define _POSIX_C_SOURCE 200809L
+
 #include <assert.h>
 #include <time.h>
 #include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
     do { \
         if (!(x)) { \
             fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
-            abort(); \
+            /*abort();*/ \
         } \
     } while (0)

diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
     const int64_t t_main_start_us = ggml_time_us();

     gpt_params params;
-    params.model = "models/llama-7B/ggml-model.bin";
+    params.model = "models/7B/ggml-model-q4_0.bin";

     if (gpt_params_parse(argc, argv, params) == false) {
         return 1;
loretoparisi commented 1 year ago

I was able to run llama.cpp in the browser with a minimal patchset and some *FLAGS

Screenshot 2023-03-20 at 17 46 01

Following the Emscripten version used:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following the compile flags:

make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html

Following the minimal patch:

diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
  $(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o

 clean:
- rm -f *.o main quantize
+ rm -f *.o main.{html,wasm,js,data,worker.js} main quantize

 main: main.cpp ggml.o utils.o
  $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
  ./main -h

+main.html: main.cpp ggml.o utils.o
+ $(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+ go run server.go
+
 quantize: quantize.cpp ggml.o utils.o
  $(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)

diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
 #include <alloca.h>
 #endif

+#define _POSIX_C_SOURCE 200809L
+
 #include <assert.h>
 #include <time.h>
 #include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
     do { \
         if (!(x)) { \
             fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
-            abort(); \
+            /*abort();*/ \
         } \
     } while (0)

diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
     const int64_t t_main_start_us = ggml_time_us();

     gpt_params params;
-    params.model = "models/llama-7B/ggml-model.bin";
+    params.model = "models/7B/ggml-model-q4_0.bin";

     if (gpt_params_parse(argc, argv, params) == false) {
         return 1;

Wow well done! Why did you had to remove abort(); from ggml?

thypon commented 1 year ago

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

loretoparisi commented 1 year ago

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.

thypon commented 1 year ago

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.

It's already quantized 4bits when converting. 7b overflows 8GB of allocated WASM64 memory though. Besides that it's quite slow since there is not both (pthread OR simd) AND memory64. Whenever I try to mix any of the two, the compiler or the linter fails to create or run the output.

okpatil4u commented 1 year ago

@thypon apparently memory64 is available in firefox nightly, did you check it ?

https://webassembly.org/roadmap/#feature-note-2

lapo-luchini commented 1 year ago

The new RedPajama-3B seems like a nice tiny model that could probably fit without memory64.

IsaacRe commented 1 year ago

@thypon @loretoparisi I'm curious, what sort of performance drop did you notice running in browser from running natively? How many toks/sec were you getting?

thypon commented 1 year ago

@IsaacRe Did not make a performance comparison since it was not 100% stable and needed to be refined. As mentioned it was single core since multithreaded + memory64 on firefox nightly was not working properly together, and crashing the experiment.

@okpatil4u already running with experimental memory64

okpatil4u commented 1 year ago

Hey @thypon, did you make any progress on this experiment ?

thypon commented 1 year ago

I'm not actively working on this at the current stage.

lukestanley commented 1 year ago

@okpatil4u I broadly followed the very useful steps above by @loretoparisi and was able to run really small models only with latest Emscripten and a fairly recent master commit. I am way out of my comfort zone with C++ or WASM (since I spent most of my time with Typescript and Python). I didn't get around to installing Firefox nightly and stopped for now. I last had the tiny Shakespeare models running in the browser. The diff by loretoparisi was made a while ago so I had to make some significant changes and I am a complete C++ noob, so take this with a bag of salt, but if it helps someone, great: https://github.com/lukestanley/llama.cpp/commit/41cbd2b8aff97c7297b3dbddbad6dfbb0f164380

rahuldshetty commented 1 year ago

I've tried the approach suggested by @lukestanley and @loretoparisi and got starcoder.cpp to run on browser. Published a demo project on this: https://github.com/rahuldshetty/starcoder.js

I tried with tiny_starcoder_py model as the weight size were quite small to fit without mem64, and tried to see the performance/accuracy. It seems like the output of the model without mem64 is gibberish while mem64 version results in meaningful output. Not sure if memory addressing in 32bit vs 64bit has to do with it.

mindplay-dk commented 1 year ago

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

How about WebGPU? Probably better to run it off-CPU where possible anyhow?

(full disclosure: I have no idea what I'm talking about.)

mohamedmansour commented 1 year ago

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

The implementation status is complete for emscripten: https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md

image

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

loretoparisi commented 6 months ago

Not sure what is the progress here, apparently there are overlapping or related opened issues.

ggerganov commented 6 months ago

There is this project that might be relevant: https://github.com/ngxson/wllama

flatsiedatsie commented 6 months ago

@ggerganov Thanks for sharing that. I'm already using https://github.com/tangledgroup/llama-cpp-wasm as the basis of a big project.

So far llama-cpp-wasm has allowed me to run pretty much any .gguf that is less than 2GB in size in the browser (and that limitation seems to be related to the caching mechanism of that project, so I suspect the real limit would be 4GB).

People talk about bringing AI to the masses, but the best way to do that is with browser-based technology. My mom is never going to install Ollama and the like.

ggerganov commented 6 months ago

My mom is never going to install Ollama and the like.

But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference

flatsiedatsie commented 6 months ago

She's already doing it :-)

Sneak preview:

sneak_preview

(100% browser based)

loretoparisi commented 6 months ago

My mom is never going to install Ollama and the like.

But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference

Agreed, the best example so far is LLM MLC, web version: https://webllm.mlc.ai/

you can see that it can download 4GB in shards like 20 shards or so for a Llama-2 7B weights, 4-bit quantized. Of course this means that you can wait from tens seconds to few minutes to start the inference. And this is not going to change soon unless the quantization at 3,2bits works better and the accuracy is good as the 4bit...

By example, if we take Llama-2, 8B we have 108 shards Screenshot 2024-04-24 at 18 29 53

and it took 114 seconds to complete on my fiber channel:

Screenshot 2024-04-24 at 18 31 48

before being ready to infer Screenshot 2024-04-24 at 18 32 41

on Mac M1 Pro I get

prefill: 13.5248 tokens/sec, decoding: 7.9857 tokens/sec
Models with “-1k” suffix signify 1024 context length, lowering ~2-3GB VRAM requirement compared to their counterparts. Feel free to start trying with those.
flatsiedatsie commented 6 months ago

Huggingface has recently released a streaming option for GGUF, where you can already start inference even though the model is noy fully loaded yet. At least, that's my understanding from a recent Youtube video by Yannic Kilcher.

For my project I'm trying to use a less than 2Gb quant of Phi 2 with 128K context. I think that model will become the best model for browser-based used for a while.

slaren commented 6 months ago

You may be thinking of a library that Huggingface released that can read GGUF metadata without downloading the whole file. You wouldn't gain much from streaming the model for inference, generally the entire model is necessary to generate every token.

flatsiedatsie commented 6 months ago

@slaren Ah, thanks for clarifying that. It sounded a little too good to be true :-)