:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
When running the evaluators over larger datasets, depending on the model, it is very common to run into LLM errors where the output is not valid JSON. For example, while running the benchmark scripts over the ARAGOG dataset, I always have one row that got incorrect JSON, so every time I've run the script I get a score report which is not very useful, such as:
metrics
score
context_relevance
NaN
In that case, the output of the LLM-based evaluation metric whenever there is an error is something like:
{'statements': [], 'statement_scores': [], 'score': nan}
As a user, I would like to keep track of the errors that happened during evaluation, so ideally this should be returned as a flag, for example:
{'statements': [], 'statement_scores': [], 'score': nan, 'error': True}
Then, in the evaluation score report, we could return the mean of the scores by ignoring the errors:
metrics
score
total_errors
context_relevance
0.9
1
Outcome
Changes to the LLM-based evaluators (context relevancy and faithfulness) so they return an error Flag.
Changes to the LLM-based evaluators to return a score even if there are rows with np.nan (for example, a suggestion is to change np.mean with np.nanmeanhere).
Changes to the score_report() function of the EvaluationRunResult to return total errors.
Context
In that case, the output of the LLM-based evaluation metric whenever there is an error is something like:
{'statements': [], 'statement_scores': [], 'score': nan}
As a user, I would like to keep track of the errors that happened during evaluation, so ideally this should be returned as a flag, for example:
{'statements': [], 'statement_scores': [], 'score': nan, 'error': True}
Outcome
np.nan
(for example, a suggestion is to changenp.mean
withnp.nanmean
here).score_report()
function of the EvaluationRunResult to return total errors.