explodinggradients / ragas

Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
https://docs.ragas.io
Apache License 2.0
6.85k stars 681 forks source link

why is answer_relevancy always 0? #1475

Open Z-oo883 opened 2 hours ago

Z-oo883 commented 2 hours ago

environment: python 3.9.20 datasets 3.0.1 langchain 0.3.3 langchain-community 0.3.2 langchain-core 0.3.10 langchain-openai 0.2.2 langchain-text-splitters 0.3.0 ragas 0.0.22

code: from langchain_community.chat_models import ChatOpenAI from langchain.prompts.chat import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, \ MessagesPlaceholder from operator import itemgetter import warnings import re import os import pandas as pd from pandas import array from langchain_community.vectorstores import FAISS, DistanceStrategy from langchain.embeddings import HuggingFaceEmbeddings from pydantic import BaseModel from langchain_core.output_parsers import JsonOutputParser import json from langchain.output_parsers import StructuredOutputParser, ResponseSchema from langchain_text_splitters import RecursiveCharacterTextSplitter from datasets import Dataset import openai from dotenv import load_dotenv

load_dotenv('.env')

warnings.filterwarnings('ignore') content_path = r"data/test.md" with open(content_path, "r", encoding='utf-8') as f: page_content = f.read()

from langchain_text_splitters import MarkdownHeaderTextSplitter

markdown_document = page_content

headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ]

markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=headers_to_split_on, strip_headers=False ) md_header_splits = markdown_splitter.split_text(markdown_document)

chat = ChatOpenAI( model="Qwen2", temperature=0.3, openai_api_key="xxxx", openai_api_base='xxxx', stop=['<|im_end|>'] )

system_prompt = SystemMessagePromptTemplate.from_template('你是一个对话助手,基于文档内容回答用户问题') user_prompt = HumanMessagePromptTemplate.from_template('''

基于下面的文档内容回答问题:

{context}

问题: {query}

''')

full_chat_prompt = ChatPromptTemplate.from_messages( [system_prompt, MessagesPlaceholder(variable_name="chat_history"), user_prompt])

''' <|im_start|>system 你是一个对话助手. <|im_end|> ... <|im_start|>user 仅基于下面的文本回答问题:

{context}

Question: {query} <|im_end|> <|im_start|>assitant ...... <|im_end|> '''

init embedding model

embedding_model_name = "embedding\bge-large-zh-v1.5" embedding_model_kwargs = {'device': 'cpu'} embedding_encode_kwargs = {'batch_size': 32, 'normalize_embeddings': True}

embed_model = HuggingFaceEmbeddings( model_name=embedding_model_name, model_kwargs=embedding_model_kwargs, encode_kwargs=embedding_encode_kwargs )

vector_load = FAISS.load_local('test.faiss', embed_model, allow_dangerous_deserialization=True) faiss_retriever = vector_load.as_retriever(search_type="similarity", search_kwargs={"k": 20}) chat_chain = { "context": itemgetter("query") | faiss_retriever, "query": itemgetter("query"), "chat_history": itemgetter("chat_history"), } | full_chat_prompt | chat

chat_history = []

questions = [ "算法编排是什么?" ] ground_truths = [ ["算法编排是可通过拖拉拽的方式在算法画布上定义组件,实现不同的业务场景下的数据流转和计算的过程。"] ] answers = [] contexts = [] for question in questions: answers.append(chat_chain.invoke({'query': question, 'chat_history': chat_history}).content) contexts.append([splits.page_content for splits in faiss_retriever.get_relevant_documents(question)])

data = { "question": questions, "answer": answers, "contexts": contexts, "ground_truths": ground_truths } print(data) dataset = Dataset.from_dict(data) print(dataset)

from langchain.chat_models.openai import ChatOpenAI from ragas.llms.langchain import LangchainLLM from ragas.metrics import context_recall, context_precision, faithfulness, answer_relevancy

rag_llm = LangchainLLM(llm=chat)

faithfulness.llm = rag_llm faithfulness.embeddings = embed_model context_recall.llm = rag_llm context_recall.embeddings = embed_model context_precision.llm = rag_llm context_precision.embeddings = embed_model answer_relevancy.llm = rag_llm answer_relevancy.embeddings = embed_model

from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_relevancy, context_recall, context_precision, )

result = evaluate( dataset=dataset, metrics=[

faithfulness,

    answer_relevancy,
    # context_relevancy,
    # context_recall,
    # context_precision,
],

)

print(result)

pd.set_option("display.max_colwidth", None)

df = result.to_pandas() print(df) df.to_csv("data/test.csv", index=False, encoding='utf_8_sig')

outcome: evaluating with [answer_relevancy] 100%|██████████| 1/1 [00:03<00:00, 3.20s/it] {'answer_relevancy': 0.0000} question ... answer_relevancy 0 算法编排是什么? ... 0.0

[1 rows x 5 columns]

Process finished with exit code 0

why is answer_relevancy always 0?What shoud I do ???please help me!!! thank you !!!

dosubot[bot] commented 2 hours ago

The answer_relevancy is always 0 because the score calculation includes a multiplication by int(not committal). If committal is True (meaning at least one answer is noncommittal), int(not committal) becomes 0, resulting in a score of 0.

Here's the relevant part of the code:

committal = np.any([answer.noncommittal for answer in answers])
score = cosine_sim.mean() * int(not committal)

If committal is True, the score will always be 0 due to the multiplication [1].

To continue talking to Dosu, mention @dosu.

Z-oo883 commented 2 hours ago

So what should I do? @Dosu

Z-oo883 commented 1 hour ago

The answer_relevancy is always 0 because the score calculation includes a multiplication by int(not committal). If committal is True (meaning at least one answer is noncommittal), int(not committal) becomes 0, resulting in a score of 0.

Here's the relevant part of the code:

committal = np.any([answer.noncommittal for answer in answers])
score = cosine_sim.mean() * int(not committal)

If committal is True, the score will always be 0 due to the multiplication [1].

To continue talking to Dosu, mention @dosu.

so what shoud I do?

dosubot[bot] commented 1 hour ago

To resolve the issue of answer_relevancy always being 0, follow these steps:

  1. Check Noncommittal Flag: Ensure that the noncommittal flag is set correctly. If all answers are marked as noncommittal, the score will be zero due to the multiplication by int(not committal).

  2. Verify Embeddings: Make sure the embeddings are correctly initialized and used. The calculate_similarity function relies on embeddings to compute cosine similarity. Incorrect embeddings will lead to a failed similarity calculation.

  3. Inspect Generated Questions: Ensure that the generated questions are not empty. If they are, the score will be set to np.nan.

  4. Review LLM Configuration: Confirm that the language model (llm) is properly configured and capable of generating relevant questions.

  5. Adjust Strictness: Consider adjusting the strictness parameter, which affects the number of questions generated per answer.

  6. Understand Relevancy Calculation: The answer_relevancy is calculated as the mean cosine similarity between the original user_input and generated questions based on the response. Ensure that the embeddings for both the generated questions and the original question are accurate [1][2].

To continue talking to Dosu, mention @dosu.

Z-oo883 commented 1 hour ago
  1. Check Noncommittal Flag

how to Check Noncommittal Flag?