RAG0005: LLM Selection - Githubissues

tenzin3 commented 1 month ago

Description:

Selecting one of the following model for the final response generation.

Phi-3-mini (4k and 128k)
Llama3-8B (8k)
google/gemma-7b

Criteria

context length
response time
faithfullness
relevancy

Excepted Output:

final LLM which is best suited for out RAG application

Implementation Steps

[x] generate questions from chunks
[x] run each llm with different number of chunk context(min=2, max=5)
[x] compare results and choose one llm

tenzin3 commented 1 month ago

Initially generated 2 questions per each chunk but then later observed the questions generated from same chunk were similar so took only first one.

book = "Art of Happiness at work by Dalai Lama" embedding model = Alibaba-NLP/gte-large-en-v1.5 question generation model = voidful/context-only-question-generator Chose chunk_size = 500 and chunk_overlap=100 number of chunks = 170 number of questions generated = 170. number of context given = 2-5(top 2 and 5) embedding similarity checker for faithfullness and relevancy = cosine similarity

prompt = f"""

You are a spiritual leader. Your students will seek your guidance. Answer their questions based solely on the provided context. If the context does not contain the information needed to answer a question, respond with "I don't know." Context: {context} Question: {question}

"""

tenzin3 commented 1 month ago

Model arguments

max number of new tokens = 500 temperature = 0

Server Specifications

microsoft/Phi-3-mini-128k-instruct model

response_time is in seconds

microsoft/Phi-3-mini-4k-instruct-gguf

The microsoft/Phi-3-mini-4k-instruct-gguf model hasn't yet been integrated into the Hugging Face Transformers module. When I downloaded the model and ran it on a Vast.ai server, the average inference time was around 90 seconds per request (average after running on first five questions), which is unsatisfactory.

google/gemma-7b

When testing the google/gemma-7b model with the simple context "Apple is green in colour." and the question "What is the colour of Apple?", the output was inaccurate and demonstrated hallucination. This suggests that the gemma-7b model is not suitable for instructional chatbot applications, as it struggles with generating precise and reliable responses.

meta-llama/Meta-Llama-3-8B-Instruct

tenzin3 commented 1 month ago

Observations from Language model responses

Negative output====>

For question "How long does it take to rewire one's mind?". llama3 answer with top 4 context :> According to the context, it takes repeated effort and time to rewire one's mind. The passage states, ""it takes repeated effort. And it takes time. For this kind of analytical meditation to work, one needs deep and sustained reflection on these alternative ways of viewing one's situation."" It also mentions that it can be a long and difficult process to reshape one's attitudes and outlook, and that many people resist giving up their misery, which can make it a challenging and time-consuming process. llama3 answer with top 5 context:>I don't know. The text does not provide a specific answer to this question. It mentions that it takes repeated effort and time to reshape one's attitudes and outlook, but it does not provide a specific timeframe for rewiring one's mind.

For answer with 5 context, it is responding with a negative sense.

Mention of context in the answer====>

For question: "Did the Dalai Lama have a shaven head?" llama3 answer with top2 context: > "I don't know. The context does not mention the Dalai Lama's hair or head, so I cannot provide an answer to this question."

The context is given using llama index retrieval method, the user don't need to know about the context information. Or we could answer with the answer is not present in the Dalai Lama books.

Inconsistent answers====>

For question: "Is there a program in India where prisoners are being taught ingestionmeditation?" phi3 answer with top2 and 3 context: " Yes, there is a program in India where prisoners are being taught meditation." phi3 answer with top 4 context: "I don't know." phi3 answer with top5 context: " No, the context provided mentions a program in India where prisoners are being taught meditation, not ingestion meditation."

tenzin3 commented 1 month ago

Conclusion

From the above table and observations from their responses, using microsoft/Phi-3-mini-128k-instruct seems like a better choice based on the following reasons:

Better Speed and Larger Context

Currently, the experiment is run and timed for one question only, but later on, for a chatbot application, it is essential to carry on context and information from previous conversations. Having a larger context and better speed would be a huge plus.

Superior Information Retrieval

Both language models were equipped with the same LLaMA index retrieval tool, and clearly, the Phi-3 model has been able to find answers much better.

Example:

Question: "How many human beings are there?"

LLaMA-3 Answers:

Top 2 chunks: "My dear student, I don't know. The context does not provide information about the number of human beings. It only speaks about a Tibetan, a supermarket, and a conversation with the Dalai Lama."
Top 3 chunks: "I don't know. The context does not provide the information about the exact number of human beings."
Top 4 chunks: "I don't know. The context does not provide the answer to this question."
Top 5 chunks: "I don't know. The text does not provide the exact number of human beings."

Phi-3 Answers:

Top 2 chunks: "I don't know."
Top 3 chunks: "There are six billion human beings in the world."
Top 4 chunks: "According to the context, there are six billion human beings in the world."
Top 5 chunks: "There are six billion human beings in the world."

Explanation:

The Phi-3 model demonstrates better contextual understanding and retrieval of relevant information.

Analysis of LLaMA-3

LLaMA-3 tends to provide longer responses even when the answer is unknown, which may contribute to higher relevance and faithfulness scores due to the repetition of the question text.

Example Question: "Who met with President George as a statesman?"

Answer: "I don't know. There is no mention of President George or any meeting with him in the provided context."

Faithfulness score: 0.6379007367153051
Relevance score: 0.8371026935179204

LLaMA-3's tendency to repeat the question text leads to higher scores when it does not have a definitive answer from the context.

Based on these factors, the microsoft/Phi-3-mini-128k-instruct model is recommended for its better speed, larger context handling, and superior information retrieval capabilities.

tenzin3 commented 3 weeks ago

Results for 80 chatgpt generated questions

Prompt

template = f"""
    You are a chatbot designed to answer questions using content from the Dalai Lama's books.

    Follow these guidelines:

    - Answer the question based on the given contexts (some of which might be irrelevant).
    - Be elaborate and precise.
    - Answer directly, without adding any extra words.
    - Be careful of the language, ensuring it is respectful and appropriate.
    - If you do not have a proper answer from the context, respond with "I dont have enough data to provide an answer."
    - Do not give a response longer than 3000 tokens.

    Contexts: {context}

    Question: {question}

    """

OpenPecha / rag_prep_tool

RAG0005: LLM Selection #8

Description:

Criteria

Excepted Output:

Implementation Steps

Model arguments

Server Specifications

microsoft/Phi-3-mini-128k-instruct model

microsoft/Phi-3-mini-4k-instruct-gguf

google/gemma-7b

meta-llama/Meta-Llama-3-8B-Instruct

Observations from Language model responses

Negative output====>

Mention of context in the answer====>

Inconsistent answers====>

Conclusion

Better Speed and Larger Context

Superior Information Retrieval

Example:

Explanation:

Analysis of LLaMA-3

Results for 80 chatgpt generated questions

Prompt

phi3-mini-128k