Atome-FE / llama-node

Believe in AI democratization. llama for nodejs backed by llama-rs, llama.cpp and rwkv.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna/rwkv model.
https://llama-node.vercel.app/
Apache License 2.0
862 stars 62 forks source link

llama-node/llama-cpp uses more memory than standalone llama.cpp with the same parameters #85

Open fardjad opened 1 year ago

fardjad commented 1 year ago

I'm trying to process a large text file. For the sake of reproducibility, let's use this. The following code:

Expand to see the code ```javascript import { LLM } from "llama-node"; import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js"; import path from "node:path"; import fs from "node:fs"; const model = path.resolve( process.cwd(), "/path/to/model.bin" ); const llama = new LLM(LLamaCpp); const prompt = fs.readFileSync("./path/to/file.txt", "utf-8"); await llama.load({ enableLogging: true, modelPath: model, nCtx: 4096, nParts: -1, seed: 0, f16Kv: false, logitsAll: false, vocabOnly: false, useMlock: false, embedding: false, useMmap: false, nGpuLayers: 0, }); await llama.createCompletion( { nThreads: 8, nTokPredict: 256, topK: 40, prompt, }, (response) => { process.stdout.write(response.token); } ); ```

Crashes the process with a segfault error:

ggml_new_tensor_impl: not enough space in the scratch memory
segmentation fault  node index.mjs

When I compile the exact same version of llama.cpp and run it with the following args:

./main -m /path/to/ggml-vic7b-q5_1.bin -t 8 -c 4096 -n 256 -f ./big-input.txt

It runs perfectly fine (of course with a warning that the context is larger than what the model supports but it doesn't crash with a segfault).

Comparing the logs:

llama-node Logs ``` llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 4936280.75 KB llama_model_load_internal: mem required = 6612.59 MB (+ 2052.00 MB per state) .................................................................................................... llama_init_from_file: kv self size = 4096.00 MB [Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::context] - AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | [Sun, 28 May 2023 14:35:50 +0000 - INFO - llama_node_cpp::llama] - tokenized_stop_prompt: None ggml_new_tensor_impl: not enough space in the scratch memory ```
llama.cpp Logs ``` main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results main: build = 561 (5ea4339) main: seed = 1685284790 llama.cpp: loading model from ../my-llmatic/models/ggml-vic7b-q5_1.bin llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 72.75 KB llama_model_load_internal: mem required = 6612.59 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 2048.00 MB system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 4096, n_batch = 512, n_predict = 256, n_keep = 0 ```

Looks like the context size in llama-node is set to 4GBs and the kv self size is twice as large as what llama.cpp used.

I'm not sure if I'm missing something in my Load/Invocation config or if that's an issue in llama-node. Can you please have a look?

hlhr202 commented 1 year ago

sure, will look into this soon.

hlhr202 commented 1 year ago

I guess it was caused by useMmap? llama.cpp will enable useMmap by default. What I v found in your llama-node code example, you seems did not enable mmap for reusing file cache in the memory, that is probably why you run out of memory I think?

fardjad commented 1 year ago

I'm afraid that is not the case. Before you updated the version of llama.cpp, I couldn't run my example (with or without setting useMmap). Now it doesn't crash, but it doesn't seem to be doing anything either.

I recorded a video comparing llama-node and llama.cpp:

https://github.com/Atome-FE/llama-node/assets/817642/b86209b0-b0da-402c-95fa-622a617e3686

As you can see, llama-node sort of freezes with the larger input, whereas llama.cpp starts emitting tokens after ~30 secs.