Flexible post-evaluation filtering

gnatesan commented 22 hours ago

I want to be able to perform post-evaluation query filtering after evaluating a model on a retrieval benchmark. In other words, after evaluation is ran I want to be able to select a subset of the test queries based on the query length and look at the performance metrics (i.e. queries of length 15-20). However, I want to do this after evaluation is completed so that I do not need to run evaluation every time I want to change the text length range for the subset. How would I save the results of 'results = evaluation.run(model, verbosity=2, eval_splits=["test"]' such that I can do this? And is this even possible?

KennethEnevoldsen commented 18 hours ago

You might want to take a look at saving retrieval task predictions

@orionw might have additional pointers

orionw commented 15 hours ago

+1 that flag helps. I tend to use this a lot as well @gnatesan, let me know if there’s additional things that would be helpful. For example, I don’t think we save out the qrels or query-specific scores, although we could add flags for those also.

embeddings-benchmark / mteb

Flexible post-evaluation filtering #1410