amajji / LLM-RAG-Chatbot-With-LangChain

Development and deployment of a question-answer LLM model using Llama2 with 7B parameters and RAG with LangChain
1 stars 0 forks source link
ai chatbot chatbot-application cpu db inference langchain llama-index llama2 llm llm-inference question-answering rag retrieval-augmented-generation streamlit streamlit-webapp vector-database

LLM RAG Chatbot (with only CPU)

Data scientist | Anass MAJJI


:monocle_face: Description

Before making a demo of the streamlit web application, let's walk through the details of the RAG approach to understand how it works. The retriever acts like an internal search engine : given a user query, it returns a few relevent elements from the external data sources. Here are the main steps of the RAG system :

In order to reach a good accuracy with the LLMs, we need to better understand and choose each hyperparameter. Before deeping dive into the details, let's remind the LLM's decoding process. As we know, LLMs rely on transformers, each one is composed with two main blocs : encoder which converts the input tokens into embeddings i.e numerical values and decoder which tries to generate tokens from embeddings (the opposit of the encoder). There are two main types of decoding : greedy and sampling. With greedy decoding, the model simply chooses the token with the highest probability at each step during inference.

With sampling decoding, in contrast, the model select a subset of potential output tokens and select randomly one of them to add to the output text. This creates more variability and helps the LLM to be more creative. However, opting for sampling decoder increases the risk of incorrect responses.

When opting for sampling decoding, we have two additional hyperparameters which impact the performance of the model : Top_k and Top_p.

Another paramater we should take into consideration is the memory needed to run the LLM: for a model with N parameter and a full precision (fp32) the memory needed is N x 4Bytes. However, when we use quantization, we divide by (4 Bytes/ new precision). With fp16, the new memory is divided by 4 Bytes/ 2 Bytes.

:rocket: Repository Structure

The repository contains the following files & directories:

:chart_with_upwards_trend: Demontration

In this section, we are going to make a demonstration of the streamlit webapp. The user can ask any question and the chatbot will answer.

To launch the deployment of the streamlit app with docker, type the following commands :

To view our app, users can browse to http://0.0.0.0:8501 or http://localhost:8501

:chart_with_upwards_trend: Performance & results


:mailbox_closed: Contact

For any information, feedback or questions, please contact me