EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.
https://www.eleuther.ai
MIT License
5.86k stars 1.56k forks source link

Integrate Semantic Answer Similarity (SAS) into the evaluation metrics. #1703

Open gonzalo-santamaria-iic opened 3 months ago

gonzalo-santamaria-iic commented 3 months ago

The Semantic Answer Similarity (SAS) metric (https://arxiv.org/abs/2108.06130) employs pretrained encoders to gauge the semantic similarity between two types of texts: predictions and references. This metric offers various computation options, such as:

Considering the following aspects:

  1. Complexity and heterogeneity of model responses: Particularly pertinent for models trained to serve as user assistants in instructional formats, where responses vary widely.
  2. Impact of token position on metrics like perplexity: This often leads to semantically valid answers registering high perplexities due to token positioning.

It could be advantageous to offer the community a metric capable of comparing texts at the semantic level, particularly in tasks where measuring responses from models designed to be more interactive, such as assistants, is of interest.

At IIC, we are collaborating with Hugging Face and SomosNLP to create the first Spanish generative LLM leaderboard using the lm-evaluation-harness library as evaluation suite. The leaderboard will include QA tasks with long complex answers evaluated using the SAS metric.

We believe that the community could also benefit from this metric. If you think that is a useful proposal, I would be delighted to open a Pull Request following the documentation on how to add new tasks and the task guide to implement the Semantic Answer Similarity metric and enable the creation of complex subjective text generation evaluation tasks.

Congratulations on your work! :) We will follow with interest the progress of this project that is so useful for the open-source community.

haileyschoelkopf commented 2 months ago

Hi!

Thank you for your interest in the library and contributing!

We'd be glad to support this! Worth a discussion on the best way to implement it though--how large are the typical evaluation encoder models you will want to be evaluating on?

The best current way to get this added to the library is to create a custom Filter subclass (https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/filters) which, when instantiated, loads your model, and handles running the SAS metric and returns the score, from which a custom metric would return those scores, I think. If you have any questions about this, or other ideas, happy to answer them!

Separately, we'd like to add LLM-as-a-judge support and will likely implement that as a 2-step process (first, create LM outputs and save them following the existing workflows, then run a secondary grading script--which could also implement this model-based evaluation metric too).

gonzalo-santamaria-iic commented 2 months ago

Hi Hailey!

Thank's for the response and your interest in our proposal. We truly appreciate it! :smile:

Regarding the size of the models, they typically range from around 100 MB to 500 MB: <= 2.5 GB in 32-bit precision.

We've experimented with three different approaches to implementation:

  1. Initially, we considered creating a filter to transform string_model_predictions into embeddings using the chosen encoder for computing the SAS. Following this, we aimed to integrate two new metrics, cosine_similarity for Biencoder SAS and sigmoid for Cross-encoder SAS, into the lm_eval/api/metrics.py script. However, we encountered challenges such as the need for additional preprocessing of the dataset, particularly for cosine distance computation, which required transforming string_references from doc_to_target into embeddings, or the complexity of concatenating string_model_predictions and string_references before the filter in the case of Cross-encoders.

  2. Create the filter that compute Semantic Answer Similarity (SAS) directly. We have seen that this can be achieved by defining an apply(self, resps, docs): method in a convenient way, since the string_model_predictions is represented by the resps variable and the string_references can be extracted from docs, similar to defining doc_to_target within the .yaml file. After filter we have: float_scores and string_references (from the doc_to_target variable). We have seen that inside the .yaml of the tasks you can set a constant value for doc_to_target, so setting that value to 1 and calling a custom metric that multiplies doc_to_text (after filer) and doc_to_target would solve it. We couldn't think of a better way to implement the SAS metric this way :worried:

  3. What about define the SAS metric directly within lm_eval/api/metrics.py?, without filters. We encountered challenges during pipeline execution, particularly with repetitive model loading during computation process, but it can be solved. We are also unsure if additional variables, such as model_name_or_path, could be incorporated when registering a new metric. Given the implementation of some metrics that operate on two strings and return a numeric value, incorporating the SAS metric seems feasible.

For the time being, we have chosen to pursue the third option within this fork, specifically in the "aquas" branch. We have defined a new task within this branch to compute both types of SAS. This task has been instrumental in our testing; it has also helped us to understand your library in more depth.

The implementation of the metrics can be found here. For example:

lm_eval --model hf --model_args pretrained=google/gemma-2b --tasks aquas --device cuda:0 --batch_size 1

Once again, thank you for your interest. We hope you find this feedback valuable and that it assists us in clarifying the best approach for implementing this metric within the harness. :slightly_smiling_face: