[Observability AI Assistant] Add recall evaluations to framework

dgieselaar commented 7 months ago

We are currently not evaluation our recall process in the evaluation framework. Our recall process involves both ELSER and the LLM, and we should add some kind of test to see how well this process works (both in isolation and in comparison to other models).

GPT4
Claude 3

marcogavaz commented 6 months ago

@dgieselaar I have done a couple of test here with the Ranking Evaluation API and i have done a function that computes an evaluation score of the quality of the ranking given a set of scored documents.

evaluation score of the quality of the ranking is NDCG score It is implemented in the Ranking Evaluation API but we cannot use it directly given that we use the LLM and just a typical elastic query.

What i think we could use is given a set of pre scored documents (thus a set of documents where we have judgments on their relevance with respect to the query) , we ask the llm to re-score them and then we can apply something like

# This is the real score that we know in advance of the sorted docs in terms of recall that the model assigned
relevance_scores = [3, 2, 3, 0, 1, 2]  
p = 6 # number of position

def dcg(relevance_scores,p):

    relevance_scores = np.asfarray(relevance_scores)[:p]

    return np.sum(
        (
            (2**relevance_scores - 1) / 
            np.log2(np.arange(2, relevance_scores.size + 2))
        )
    )

ndcg_score = dcg(relevance_scores,p)/dcg(sorted(relevance_scores, reverse=True),p)

The pro of this approach is that it has a [0,1] support, thus the scale is stable and we can compare across different scores with different LLMs. On the other side the score is not easily interpretable. @dgieselaar wdyt? Is this an approach we can intriduce in the eval framework?

emma-raffenne commented 3 months ago

Recall function is being tested as part of the evaluation framework, see KB scenario. On the other hand, since we are moving to using the inference API, that will provide reranking, I don't think it's necessary to evaluate it on our side anymore.

@dgieselaar @grabowskit wdyt?

dgieselaar commented 3 months ago

@emma-raffenne I think we should still do this, given that the timeline on moving to the inference API, and specifically the re-ranking part, is unclear.

elastic / kibana

[Observability AI Assistant] Add recall evaluations to framework #179635