OpenPecha / rag_prep_tool

MIT License
0 stars 0 forks source link

RAG0005: LLM Selection #8

Closed tenzin3 closed 2 weeks ago

tenzin3 commented 1 month ago

Description:

Selecting one of the following model for the final response generation.

Criteria

Excepted Output:

final LLM which is best suited for out RAG application

Implementation Steps

tenzin3 commented 1 month ago

Initially generated 2 questions per each chunk but then later observed the questions generated from same chunk were similar so took only first one.

book = "Art of Happiness at work by Dalai Lama" embedding model = Alibaba-NLP/gte-large-en-v1.5 question generation model = voidful/context-only-question-generator Chose chunk_size = 500 and chunk_overlap=100 number of chunks = 170 number of questions generated = 170. number of context given = 2-5(top 2 and 5) embedding similarity checker for faithfullness and relevancy = cosine similarity

prompt = f"""

You are a spiritual leader. Your students will seek your guidance. Answer their questions based solely on the provided context. If the context does not contain the information needed to answer a question, respond with "I don't know." Context: {context} Question: {question}

"""

tenzin3 commented 1 month ago

Model arguments

max number of new tokens = 500 temperature = 0

Server Specifications

Image

microsoft/Phi-3-mini-128k-instruct model

response_time is in seconds Image

microsoft/Phi-3-mini-4k-instruct-gguf

The microsoft/Phi-3-mini-4k-instruct-gguf model hasn't yet been integrated into the Hugging Face Transformers module. When I downloaded the model and ran it on a Vast.ai server, the average inference time was around 90 seconds per request (average after running on first five questions), which is unsatisfactory.

google/gemma-7b

When testing the google/gemma-7b model with the simple context "Apple is green in colour." and the question "What is the colour of Apple?", the output was inaccurate and demonstrated hallucination. This suggests that the gemma-7b model is not suitable for instructional chatbot applications, as it struggles with generating precise and reliable responses.

Image

meta-llama/Meta-Llama-3-8B-Instruct

Image

tenzin3 commented 1 month ago

Observations from Language model responses

Negative output====>

For question "How long does it take to rewire one's mind?". llama3 answer with top 4 context :> According to the context, it takes repeated effort and time to rewire one's mind. The passage states, ""it takes repeated effort. And it takes time. For this kind of analytical meditation to work, one needs deep and sustained reflection on these alternative ways of viewing one's situation."" It also mentions that it can be a long and difficult process to reshape one's attitudes and outlook, and that many people resist giving up their misery, which can make it a challenging and time-consuming process. llama3 answer with top 5 context:>I don't know. The text does not provide a specific answer to this question. It mentions that it takes repeated effort and time to reshape one's attitudes and outlook, but it does not provide a specific timeframe for rewiring one's mind.

For answer with 5 context, it is responding with a negative sense.

Mention of context in the answer====>

For question: "Did the Dalai Lama have a shaven head?" llama3 answer with top2 context: > "I don't know. The context does not mention the Dalai Lama's hair or head, so I cannot provide an answer to this question."

The context is given using llama index retrieval method, the user don't need to know about the context information. Or we could answer with the answer is not present in the Dalai Lama books.

Inconsistent answers====>

For question: "Is there a program in India where prisoners are being taught ingestionmeditation?" phi3 answer with top2 and 3 context: " Yes, there is a program in India where prisoners are being taught meditation." phi3 answer with top 4 context: "I don't know." phi3 answer with top5 context: " No, the context provided mentions a program in India where prisoners are being taught meditation, not ingestion meditation."

tenzin3 commented 1 month ago

Conclusion

Image

From the above table and observations from their responses, using microsoft/Phi-3-mini-128k-instruct seems like a better choice based on the following reasons:

Better Speed and Larger Context

Currently, the experiment is run and timed for one question only, but later on, for a chatbot application, it is essential to carry on context and information from previous conversations. Having a larger context and better speed would be a huge plus.

Superior Information Retrieval

Both language models were equipped with the same LLaMA index retrieval tool, and clearly, the Phi-3 model has been able to find answers much better.

Example:

Question: "How many human beings are there?"

LLaMA-3 Answers:

Phi-3 Answers:

Explanation:

The Phi-3 model demonstrates better contextual understanding and retrieval of relevant information.

Analysis of LLaMA-3

LLaMA-3 tends to provide longer responses even when the answer is unknown, which may contribute to higher relevance and faithfulness scores due to the repetition of the question text.

Example Question: "Who met with President George as a statesman?"

Answer: "I don't know. There is no mention of President George or any meeting with him in the provided context."

LLaMA-3's tendency to repeat the question text leads to higher scores when it does not have a definitive answer from the context.

Based on these factors, the microsoft/Phi-3-mini-128k-instruct model is recommended for its better speed, larger context handling, and superior information retrieval capabilities.

tenzin3 commented 3 weeks ago

Results for 80 chatgpt generated questions

Prompt

template = f"""
    You are a chatbot designed to answer questions using content from the Dalai Lama's books.

    Follow these guidelines:

    - Answer the question based on the given contexts (some of which might be irrelevant).
    - Be elaborate and precise.
    - Answer directly, without adding any extra words.
    - Be careful of the language, ensuring it is respectful and appropriate.
    - If you do not have a proper answer from the context, respond with "I dont have enough data to provide an answer."
    - Do not give a response longer than 3000 tokens.

    Contexts: {context}

    Question: {question}

    """

phi3-mini-128k

Image