Open gonzalo-santamaria-iic opened 3 months ago
Hi!
Thank you for your interest in the library and contributing!
We'd be glad to support this! Worth a discussion on the best way to implement it though--how large are the typical evaluation encoder models you will want to be evaluating on?
The best current way to get this added to the library is to create a custom Filter subclass (https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/filters) which, when instantiated, loads your model, and handles running the SAS metric and returns the score, from which a custom metric would return those scores, I think. If you have any questions about this, or other ideas, happy to answer them!
Separately, we'd like to add LLM-as-a-judge support and will likely implement that as a 2-step process (first, create LM outputs and save them following the existing workflows, then run a secondary grading script--which could also implement this model-based evaluation metric too).
Hi Hailey!
Thank's for the response and your interest in our proposal. We truly appreciate it! :smile:
Regarding the size of the models, they typically range from around 100 MB to 500 MB: <= 2.5 GB in 32-bit precision.
We've experimented with three different approaches to implementation:
Initially, we considered creating a filter to transform string_model_predictions
into embeddings using the chosen encoder for computing the SAS. Following this, we aimed to integrate two new metrics, cosine_similarity
for Biencoder SAS and sigmoid
for Cross-encoder SAS, into the lm_eval/api/metrics.py
script. However, we encountered challenges such as the need for additional preprocessing of the dataset, particularly for cosine distance computation, which required transforming string_references
from doc_to_target
into embeddings, or the complexity of concatenating string_model_predictions
and string_references
before the filter in the case of Cross-encoders.
Create the filter that compute Semantic Answer Similarity (SAS) directly. We have seen that this can be achieved by defining an apply(self, resps, docs):
method in a convenient way, since the string_model_predictions
is represented by the resps
variable and the string_references
can be extracted from docs
, similar to defining doc_to_target
within the .yaml
file. After filter we have: float_scores
and string_references
(from the doc_to_target
variable). We have seen that inside the .yaml
of the tasks you can set a constant value for doc_to_target
, so setting that value to 1 and calling a custom metric that multiplies doc_to_text
(after filer) and doc_to_target
would solve it. We couldn't think of a better way to implement the SAS metric this way :worried:
What about define the SAS metric directly within lm_eval/api/metrics.py
?, without filters. We encountered challenges during pipeline execution, particularly with repetitive model loading during computation process, but it can be solved. We are also unsure if additional variables, such as model_name_or_path
, could be incorporated when registering a new metric. Given the implementation of some metrics that operate on two strings and return a numeric value, incorporating the SAS metric seems feasible.
For the time being, we have chosen to pursue the third option within this fork, specifically in the "aquas" branch. We have defined a new task within this branch to compute both types of SAS. This task has been instrumental in our testing; it has also helped us to understand your library in more depth.
The implementation of the metrics can be found here. For example:
lm_eval --model hf --model_args pretrained=google/gemma-2b --tasks aquas --device cuda:0 --batch_size 1
Once again, thank you for your interest. We hope you find this feedback valuable and that it assists us in clarifying the best approach for implementing this metric within the harness. :slightly_smiling_face:
The Semantic Answer Similarity (SAS) metric (https://arxiv.org/abs/2108.06130) employs pretrained encoders to gauge the semantic similarity between two types of texts: predictions and references. This metric offers various computation options, such as:
Considering the following aspects:
It could be advantageous to offer the community a metric capable of comparing texts at the semantic level, particularly in tasks where measuring responses from models designed to be more interactive, such as assistants, is of interest.
At IIC, we are collaborating with Hugging Face and SomosNLP to create the first Spanish generative LLM leaderboard using the lm-evaluation-harness library as evaluation suite. The leaderboard will include QA tasks with long complex answers evaluated using the SAS metric.
We believe that the community could also benefit from this metric. If you think that is a useful proposal, I would be delighted to open a Pull Request following the documentation on how to add new tasks and the task guide to implement the Semantic Answer Similarity metric and enable the creation of complex subjective text generation evaluation tasks.
Congratulations on your work! :) We will follow with interest the progress of this project that is so useful for the open-source community.