Open katopz opened 6 months ago
Can you perhaps try this?
Can you perhaps try this?
This one took 6.88s seem to be faster.🤔
Just make sure that the embedding model you used to generate the vector collection / snapshot is the same as the one rag-api-server starts with.
I'm not quite sure which line i've to check, I follow step from readme which is
wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-2-7b-chat-hf-Q5_K_M.gguf \
--nn-preload embedding:GGML:AUTO:all-MiniLM-L6-v2-ggml-model-f16.gguf \
rag-api-server.wasm \
--model-name Llama-2-7b-chat-hf-Q5_K_M,all-MiniLM-L6-v2-ggml-model-f16 \
--ctx-size 4096,384 \
--prompt-template llama-2-chat \
--rag-prompt "Use the following pieces of context to answer the user's question.\nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n" \
--log-prompts \
--log-stat
and
curl -X POST http://127.0.0.1:8080/v1/create/rag -F "file=@paris.txt"
it should be same?
You started the rag api server with all-MiniLM-L6-v2-ggml-model-f16.gguf
So, the command you used to create the embeddings should also be all-MiniLM-L6-v2-ggml-model-f16.gguf
If you just ran the steps in the docs, you should be fine.
Yes i just run 100% steps in the docs(for many times by now), but it's still slow.
I think I miss something pretty obvious 🤔.
after try step from readme
It took 590824.84 ms = nearly 1 minute for only chunking 306 lines (91KB) file on m3 max.
This is just me or I miss some flag?