Atome-FE / llama-node

Believe in AI democratization. llama for nodejs backed by llama-rs, llama.cpp and rwkv.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna/rwkv model.
https://llama-node.vercel.app/
Apache License 2.0
862 stars 62 forks source link

Code only using 4 CPU, when I have 16 CPU #69

Open gaurav-cointab opened 1 year ago

gaurav-cointab commented 1 year ago

This is the code that I am using

import {RetrievalQAChain} from 'langchain/chains'; import {HNSWLib} from "langchain/vectorstores"; import {RecursiveCharacterTextSplitter} from 'langchain/text_splitter'; import {LLamaEmbeddings} from "llama-node/dist/extensions/langchain.js"; import {LLM} from "llama-node"; import {LLamaCpp} from "llama-node/dist/llm/llama-cpp.js"; import as fs from 'fs'; import as path from 'path';

const txtFilename = "TrainData"; const txtPath = ./${txtFilename}.txt; const VECTOR_STORE_PATH = ${txtFilename}.index; const model = path.resolve(process.cwd(), './h2ogptq-oasst1-512-30B.ggml.q5_1.bin'); const llama = new LLM(LLamaCpp); const config = { path: model, enableLogging: true, nCtx: 1024, nParts: -1, seed: 0, f16Kv: false, logitsAll: false, vocabOnly: false, useMlock: false, embedding: true, useMmap: true, }; var vectorStore; const run = async () => { await llama.load(config); if (fs.existsSync(VECTOR_STORE_PATH)) { console.log('Vector Exists..'); vectorStore = await HNSWLib.fromExistingIndex(VECTOR_STORE_PATH, new LLamaEmbeddings({maxConcurrency: 1}, llama)); } else { console.log('Creating Documents'); const text = fs.readFileSync(txtPath, 'utf8'); const textSplitter = new RecursiveCharacterTextSplitter({chunkSize: 1000}); const docs = await textSplitter.createDocuments([text]); console.log('Creating Vector'); vectorStore = await HNSWLib.fromDocuments(docs, new LLamaEmbeddings({maxConcurrency: 1}, llama)); await vectorStore.save(VECTOR_STORE_PATH); } console.log('Testing Vector via Similarity Search'); const resultOne = await vectorStore.similaritySearch("what is a template", 1); console.log(resultOne); console.log('Testing Vector via RetrievalQAChain'); const chain = RetrievalQAChain.fromLLM(llama, vectorStore.asRetriever()); const res = await chain.call({ query: "what is a template", }); console.log({res}); }; run();

It is only using 4 CPU at the time of "vectorStore = await HNSWLib.fromDocuments(docs, new LLamaEmbeddings({maxConcurrency: 1}, llama));"

Can we change anything for it to use more than 4 CPU?

hlhr202 commented 1 year ago

not yet. llama.cpp seems not supporting parallel inference at the moment. I may find another ways (like implement a round robin in rust level) for this.

pavelpiha commented 1 year ago

@hlhr202 any updates on that?

HolmesDomain commented 1 year ago

@hlhr202 u gotta hire us when you make it big :) LGTM