Open divrajput opened 2 months ago
Hey there, @divrajput! I'm here to help you with the issue you've encountered. Let's work together to solve this bug and get things back on track!
To resolve the issue of encountering an exception and resulting in NaN values in the DataFrame while evaluating a dataset using the ragas library with the Langchain LLM and Sentence Transformer embeddings, you can set the raise_exceptions
parameter to False
in the evaluate
function. This will ensure that the evaluation will return np.nan
for the row that failed instead of raising an exception.
Here is the relevant part of the code where you can set this parameter:
result = evaluate(
dataset=dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
raise_exceptions=False, # Set this to False to handle exceptions gracefully
llm=langchain_llm,
embeddings=langchain_embeddings
)
By setting raise_exceptions
to False
, the evaluation will handle exceptions gracefully and return np.nan
for any rows that encounter issues [1].
Additionally, ensure that your RunConfig
is properly set up to handle retries and timeouts, which can help mitigate transient errors:
from ragas.run_config import RunConfig
run_config = RunConfig(
max_retries=10,
max_wait=60,
thread_timeout=300,
log_tenacity=True
)
result = evaluate(
dataset=dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
raise_exceptions=False,
llm=langchain_llm,
embeddings=langchain_embeddings,
run_config=run_config # Pass the run_config here
)
This configuration will help manage retries and timeouts more effectively [2][1].
To continue talking to Dosu, mention @dosu.
While evaluating my dataset, I am encountering NaN values for the metrics faithfulness and answer_relevancy. However, I am successfully getting values for context_precision and context_recall. I have increased the run configuration values to ensure the process has sufficient retries and wait times:
run_config = RunConfig(
max_retries=15,
max_wait=120,
thread_timeout=500,
log_tenacity=True
)
Output: | question | ground_truth | answer | contexts | context_precision | context_recall | faithfulness | answer_relevancy |
---|---|---|---|---|---|---|---|---|
What are the major sources of carbohydrates in... | The traditional Hawaiian diet was rich in carb... | The majority of the diet was made up of these... | [['• Describe the different types of simple a... | 1.0 | 1.0 | NaN | 0.552834 | |
Why are carbohydrates considered essential for... | Carbohydrates are essential for the human body... | Carbohydrates provide a ready source of energ... | [['Protein Necessary for tissue formation, ce... | 1.0 | 1.0 | NaN | NaN |
Request: Please guide on resolving the NaN values for faithfulness and answer_relevancy metrics.
The NaN values for the faithfulness
and answer_relevancy
metrics are likely due to issues in the statement generation or parsing process. Specifically, the _create_statements_prompt
and _create_nli_prompt
methods, or the LLM's response to these prompts, might be causing the problem.
Here are a few steps to debug and resolve the issue:
Check the LLM's Response:
Ensure that the LLM is generating valid responses for the prompts created in _create_statements_prompt
and _create_nli_prompt
.
Verify Statement Parsing:
Ensure that the statements generated by the LLM are correctly parsed by _statements_output_parser
and _faithfulness_output_parser
.
Inspect the Prompts:
Verify that the prompts created in _create_statements_prompt
and _create_nli_prompt
are correctly formatted and contain the necessary information.
Here is a modified version of the _ascore
method with additional logging to help debug the issue:
async def _ascore(
self: t.Self, row: t.Dict, callbacks: Callbacks, is_async: bool
) -> float:
"""
returns the NLI score for each (q, c, a) pair
"""
assert self.llm is not None, "LLM is not set"
# Create statements prompt
p_value = self._create_statements_prompt(row)
statements = await self.llm.generate(
p_value,
callbacks=callbacks,
is_async=is_async,
)
statements = await _statements_output_parser.aparse(
statements.generations[0][0].text, p_value, self.llm, self.max_retries
)
if statements is None:
logger.warning("No statements generated by the LLM.")
return np.nan
statements = [item["simpler_statements"] for item in statements.dicts()]
statements = [item for sublist in statements for item in sublist]
assert isinstance(statements, t.List), "statements must be a list"
# Create NLI prompt
p_value = self._create_nli_prompt(row, statements)
nli_result = await self.llm.generate(
p_value,
callbacks=callbacks,
is_async=is_async,
n=self._reproducibility,
)
nli_result_text = [
nli_result.generations[0][i].text for i in range(self._reproducibility)
]
faithfulness_list = [
await _faithfulness_output_parser.aparse(
text, p_value, self.llm, self.max_retries
)
for text in nli_result_text
]
faithfulness_list = [
faith.dicts() for faith in faithfulness_list if faith is not None
]
if faithfulness_list:
faithfulness_list = ensembler.from_discrete(
faithfulness_list,
"verdict",
)
faithfulness_list = StatementFaithfulnessAnswers.parse_obj(
faithfulness_list
)
else:
logger.warning("No faithfulness results generated by the LLM.")
return np.nan
return self._compute_score(faithfulness_list)
Additionally, for the answer_relevancy
metric, ensure that the LLM is correctly generating questions and identifying noncommittal answers. Here is the relevant code snippet for answer_relevancy
:
@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
name: str = "answer_relevancy"
evaluation_mode: EvaluationMode = EvaluationMode.qac
question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
strictness: int = 3
def calculate_similarity(
self: t.Self, question: str, generated_questions: list[str]
):
assert self.embeddings is not None
question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
gen_question_vec = np.asarray(
self.embeddings.embed_documents(generated_questions)
).reshape(len(generated_questions), -1)
norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
question_vec, axis=1
)
return (
np.dot(gen_question_vec, question_vec.T).reshape(
-1,
)
/ norm
)
def _calculate_score(
self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict
) -> float:
question = row["question"]
gen_questions = [answer.question for answer in answers]
committal = np.any([answer.noncommittal for answer in answers])
if all(q == "" for q in gen_questions):
logger.warning(
"Invalid JSON response. Expected dictionary with key 'question'"
)
score = np.nan
else:
cosine_sim = self.calculate_similarity(question, gen_questions)
score = cosine_sim.mean() * int(not committal)
return score
def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
ans, ctx = row["answer"], row["contexts"]
return self.question_generation.format(answer=ans, context="\n".join(ctx))
async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
assert self.llm is not None, "LLM is not set"
prompt = self._create_question_gen_prompt(row)
result = await self.llm.generate(
prompt,
n=self.strictness,
callbacks=callbacks,
is_async=is_async,
)
answers = [
await _output_parser.aparse(result.text, prompt, self.llm)
for result in result.generations[0]
]
if any(answer is None for answer in answers):
return np.nan
answers = [answer for answer in answers if answer is not None]
return self._calculate_score(answers, row)
This code includes additional logging to help identify where the process might be failing. If the LLM is not generating valid statements or NLI results, the logs will indicate this, allowing you to further investigate the issue [1][2].
To continue talking to Dosu, mention @dosu.
@divrajput it will be most likely due to json following capabilites of the model. Do you have any tracing tools setup?
I encountered an issue while evaluating a dataset using the ragas library with the Langchain LLM and Sentence Transformer embeddings. The process throws an exception during execution. Steps to Reproduce:
Output:
Setting Up the Environment:
Error