Atome-FE / llama-node

Believe in AI democratization. llama for nodejs backed by llama-rs, llama.cpp and rwkv.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna/rwkv model.
https://llama-node.vercel.app/
Apache License 2.0
864 stars 62 forks source link

Llama2 quantized q5_1 #108

Open HolmesDomain opened 1 year ago

HolmesDomain commented 1 year ago

I am getting this error:

llama.cpp: loading model from /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin
error loading model: unrecognized tensor type 14

llama_init_from_file: failed to load model
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: Failed to initialize LLama context from file: /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin] {
  code: 'GenericFailure'
}

My index.js:

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "./llama-2-7b-chat/ggml-model-q5_1.bin");
const llama = new LLM(LLamaCpp);
const config = {
    modelPath: model,
    enableLogging: false,
    nCtx: 1024,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
    nGpuLayers: 0
};

const run = async () => {
    await llama.load(config);

    await llama.createCompletion({
        prompt: "My favorite movie is",
        nThreads: 4,
        nTokPredict: 1024,
        topK: 40,
        topP: 0.1,
        temp: 0.3,
        repeatPenalty: 1,
      }, (response) => {
        process.stdout.write(response.token)
      })
  }

  run();

It worked before I quantized, but I am hoping quantization makes it faster because it is so slow right now (I'm assuming this would have fixed the speed).

HolmesDomain commented 1 year ago

Got it running by using the .bin file from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main

Had no luck generating the q5_1 from here (via the instructions): https://github.com/ggerganov/llama.cpp#prepare-data--run

If this is a common problem maybe you can point people in the direction of just doing a direct download from TheBloke.