Atome-FE / llama-node

Believe in AI democratization. llama for nodejs backed by llama-rs, llama.cpp and rwkv.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna/rwkv model.
https://llama-node.vercel.app/
Apache License 2.0
865 stars 63 forks source link

Interactive #47

Open luca-saggese opened 1 year ago

luca-saggese commented 1 year ago

I'm new to llm and llama but learning fast, I've wrote a small piece of code to chat via cli, but it seems to not follow the context (ie work in interactive mode).

import { LLM } from "llama-node";
import readline from "readline";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
const saveSession = path.resolve(process.cwd(), "./tmp/session.bin");
const loadSession = path.resolve(process.cwd(), "./tmp/session.bin");

import path from "path";
const model = path.resolve(process.cwd(), "./ggml-vic7b-q4_1.bin"); 

const llama = new LLM(LLamaRS);
llama.load({ path: model });

var rl = readline.createInterface(process.stdin, process.stdout);
console.log("Chatbot started!");
rl.setPrompt("> ");
rl.prompt();
rl.on("line", async function (line) {
    const prompt = `A chat between a user and an assistant.
    USER: ${line}
    ASSISTANT:`;
    llama.createCompletion({
        prompt,
        numPredict: 128,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
        feedPrompt: true,
        saveSession,
        loadSession,
    }, (response) => {
        if(response.completed) {
            process.stdout.write('\n'); 
            rl.prompt(); 
        } else {
            process.stdout.write(response.token);
        }  
    });
});

I'm missing something?

hlhr202 commented 1 year ago

@luca-saggese you need to maintain the context on the nodejs side. ie. you should maintain a list of chatting histories where every items of the list should not exceed the context length of your model. thats why llama-node also expose the tokenizer to node.js.

luca-saggese commented 1 year ago

@hlhr202 thanks for the comment, where should I pass the context to the new query? within the prompt?

hlhr202 commented 1 year ago

@hlhr202 thanks for the comment, where should I pass the context to the new query? within the prompt?

yes, your prompt should be a string that compose chatting list. at the same time you also have to make sure it doesnt exceed the context length limit of the model

luca-saggese commented 1 year ago

understood, and what is the point of saveSession and loadSession?

hlhr202 commented 1 year ago

understood, and what is the point of saveSession and loadSession?

https://github.com/Atome-FE/llama-node/issues/24

They are used for accelerating loading.

end-me-please commented 1 year ago

@luca-saggese i had great success using saveSession/loadSession for chatbots. (thanks for implementing it hlhr202 <3 it made everything so much easier)

Keeping a list of previous messages in every prompt (as he suggested) works, but is slow.

Instead, during startup, i call createCompletion (initial prompt) with feedPromptOnly and saveSession once. (can also copy the initial cache file to make future startup faster)

Every new message is added individually with feedPromptOnly, saveSession+loadSession

to get a bot response, just call without feedPromptOnly as usual

This is still limited by context length, with the added disadvantage that you can't clear old messages (takes a while to run into the 2048 token ctx limit tho)

also seems to improve "conversation memory" without extra cost of including more messages in the chat history

end-me-please commented 1 year ago

regarding the context length limit; https://github.com/rustformers/llm/issues/77 might be related

luca-saggese commented 1 year ago

@end-me-please thanks fo the help, here is a working version for anyone interested:

import { LLM } from "llama-node";
import readline from "readline";
import fs from "fs";
import { LLamaRS } from "llama-node/dist/llm/llama-rs.js";
import path from "path";

const sessionFile = path.resolve(process.cwd(), "./tmp/session.bin");
const saveSession = sessionFile;
const loadSession = sessionFile;
// remove old session
if(fs.existsSync(sessionFile)) fs.unlinkSync(sessionFile);

const model = path.resolve(process.cwd(), "./ggml-vic7b-q4_1.bin"); // ggml-vicuna-7b-1.1-q4_1.bin");

const llama = new LLM(LLamaRS);
llama.load({ path: model });

var rl = readline.createInterface(process.stdin, process.stdout);
console.log("Chatbot started!");
rl.setPrompt("> ");
rl.prompt();
let cnt = 0;
rl.on("line", async function (line) {
    // Here Passing our input text to the manager to get response and display response answer.
    const prompt = `USER: ${line}
                    ASSISTANT:`;
    llama.createCompletion({
        prompt: cnt ==0 ? 'A chat between a user and an assistant.\n\n' + prompt : prompt,
        numPredict: 1024,
        temp: 0.2,
        topP: 1,
        topK: 40,
        repeatPenalty: 1,
        repeatLastN: 64,
        seed: 0,
        feedPrompt: true, //: cnt == 0,
        saveSession,
        loadSession,
    }, (response) => {
        if(response.completed) {
            process.stdout.write('\n'); 
            rl.prompt(); 
            cnt ++;
        } else {
            process.stdout.write(response.token);
        }  
    });
});
ralyodio commented 1 year ago

can we make it so previous prompts are part of an array? Otherwise it would continuously show the entire history with every response.

linonetwo commented 1 year ago

There is an interactive GUI in https://talk.tiddlywiki.org/t/tidgi-is-the-first-bi-link-note-taking-app-with-free-local-ai-that-works-totally-offline-and-privately/7600

CodeJjang commented 1 year ago

@end-me-please @luca-saggese I can't make it work. I am calling:

llama.load(config).then(() => {
    return llama.createCompletion({
      nThreads: 4,
      nTokPredict: 2048,
      topK: 40,
      topP: 0.1,
      temp: 0.8,
      repeatPenalty: 1,
      prompt: instructions,
      feedPrompt: true,
      feedPromptOnly: true,
      saveSession,
      loadSession
    }, (resp) => {console.log(resp)})
  }).then(() => console.log('Finished init llm'))

Two weird things:

  1. No session file created
  2. Why is the callback of "console.log(resp)" being called if feedPromptOnly is true (i.e. shouldn't do inference)?

And then:

    const resp = await llama.createCompletion({
      nThreads: 4,
      nTokPredict: 2048,
      topK: 40,
      topP: 0.1,
      temp: 0.8,
      repeatPenalty: 1,
      prompt,
      loadSession
    }, (cbResp) => {process.stdout.write(cbResp.token);})

The first prompt that I fed is completely ignored...