Running the app takes some time to load the model into memory , and since we're using quantized version, llm.to('cuda') is not made use of.
The answers from the RAG are pretty decent given that the prompt is structured well. Below is a screenshot of the 2 shot learning 👍🏽
If the context is not provided, it does reply stating that it does not have information to answer the question.
I've not evaluated using any benchmarks yet.
Evaluation of GPU requirements to be checked since the task is slow during loading of embedding and and docs embedding generation. However the one of the pdf's is a textbook so I wonder if that is bottleneck.
Future tasks:
A chat app using gradio.
Enhancements to RAG maybe embedding only keyphrase using Keybert and checking if response is good or bad ( experimental)
Running the app takes some time to load the model into memory , and since we're using quantized version, llm.to('cuda') is not made use of.
The answers from the RAG are pretty decent given that the prompt is structured well. Below is a screenshot of the 2 shot learning 👍🏽
If the context is not provided, it does reply stating that it does not have information to answer the question.
I've not evaluated using any benchmarks yet.
Future tasks: