Open gutelius opened 1 year ago
Quick braindump, I just worked on evaluation strategy for my Meedan/Rockefeller thing.
Basically we crafted 20 test cases and then an editor reviewed them and gave gold standard / human ground truth for all 20.
At first I just used a spreadsheet and did it manually (a professional editor made the classification judgments.)
(Screenshot shows: the test of a classifier which responds yes or no, and is sometimes incorrect, with interesting caveats about the surprising behavior of the LLM. The classification is "is this article an example of solutions journalism" which is itself a specific evaluation standard.)
Then I wanted to automate the evaluation. I was able set up promptable and run an executable process that could be part of a build pipeline. Though it is still "just a spreadsheet" and all the evaluation is still human. I was basically editing JSON instead of Sheets.
Takeaway: We should definitely figure this out and do semi-automated evaluation, but just doing any regular, subjective check for overall quality, tone, etc. seems way more important at this step than having it be completely automated or even having a very large test suite. The interesting stuff is how the model might evade the question, or give too much detail, etc. It helps enumerate unexpected behaviors.
Defending a single overall metric does not realistically make the model better or prevent regressions unless you have specifically crafted a reliable test for that case.
I do think it is possible to design test cases that are more reliable, but it's going to be a flaky type of test that can't really block the build.
I think ideally we would have 100+ tests that would run in CI and have reasonably deterministic output (less than 5% flakiness in the test). Each test would give a specific input context and then ask a question and the natural language response would need to contain certain character string (regex) or some correctly-ranked search result, or some reliable indicator that it got the query correct (?) If we are trying to test nondeterministic natural language, I hope we don't have to resort to using another LLM call to interpret the result, because then we need a test for THAT model. Slippery stuff.
Promptable shut down in July 🤦
It does seem like there is some potential to use a language model to assist with the generation of the initial dataset. For example it could generate 10 questions that could be asked about a single PDF. They could be structured as Yes/No questions. It could propose answers and put the whole thing in a JSON file of question-answer-pairs. Then we just execute it in a loop, correct any bad answers and call that v1 of "SonoEval" or(client specific.) Over time we could just keep adding to it to capture more weird edge cases and any errors that get flagged. I do think we should have some type of formal evaluation dataset which is basically branded and delivered to the client as a deliverable. Their own little benchmark. It just seems like a great focal point on the project. We don't need to deliver usability reports in powerpoint, we need to itemize real use cases, especially if they are not handled well.
We can evaluate inference duration, or wall-clock performance, and cost of inference. WandB has graphs which just track GPU and memory usage which seems smart and useful at earliest stages. I can also imagine a cost evaluator (since we will know precisely how much the evaluation costs to run.) There are probably other metrics like this that give more of an "information scent" than a "proper benchmark."
I guess this type of evaluation strategy is supplemental other methods of getting user feedback during user session runtime. A "thumbs up/down" UI element could perhaps feed into our evaluation reporting, perhaps by just flagging cases that we might want to add to the test set.
So, Assume we have n PDFs and we want to test them every time we merge to main. (Probably n could be as small as 2.) We could just evaluate answers to the discrete, correct answer we have like "how many foo in a baz" — probably an integer or boolean. We would have to confirm that they were not answerable by the model (eg. did not appear in training set.) We make say 50 of these and just ask them all to test inference toward a deterministic answer which we know can be inferred from the PDFs. If we do enough of these we could do normal precision/recall/F1 charts?
We can ask it to generate code and then execute the code. For example if we have a property graph, we could try to update it an compute it like Paul did. OpenAI functions endpoint has a built-in feature that can test for "Executability" like this and I think this is how code generation models like Code Llama are tested. Essentially generating their own unit test which has to pass.
Note there is a question closely related to evaluation, the core user journey — https://github.com/TheDataGuild/mind-palace/issues/11
[Merging in a thread from Slack here]
@Quantisan I realize you are asking about “benchmarking” and I am talking about “evaluation”
As I am thinking of the relationship between these terms, and the HF How-To Guide has these 3 high-level categories of metrics:
- Generic metrics, which can be applied to a variety of situations and datasets, such as precision and accuracy.
- Task-specific metrics, which are limited to a given task, such as Machine Translation (often evaluated using metrics BLEU or ROUGE) or Named Entity Recognition (often evaluated with seqeval).
- Dataset-specific metrics, which aim to measure model performance on specific benchmarks: for instance, the GLUE benchmark has a dedicated evaluation metric.
So for me that was helpful clarification of terminology. Benchmarks are one of three categories of evaluation. Benchmarks are the “dataset-specific” type. (My experience as UX researcher makes me want to also add "qualitative evaluation" to the list, but the main focus here is on automated evaluation.)
Based on what I am learning from the HF docs, it seems there are aspects of our pipeline that can be test existing metrics directly with no custom code, like NER. But although these are probably easiest to test for but the least useful, since they are already tested at the model level (Assuming our RAG system will not cause regressions on NER benchmarks for the models we use.)
More likely we want to test recognition of things that are in our custom corpus.
Also I think we don't want to run a large benchmark which would probably take forever to set up and run for example I'm looking at SuperGLUE and it is impressive but intimidating in scope. https://gluebenchmark.com/leaderboard
As I was saying to Paul this morning in Slack I think we just need a GitHub Action that asks a few questions and verifies the result is accurate.
@here what's out there in terms of tooling for Continuous Benchmarking for data science projects similar to Continuous Integration for software development? e.g. what's the current best practices? I'm thinking of just using CI (Github Actions) with pytest, where a branch of the tests are running benchmarks and printing some numbers for manual inspection instead.
I agreed with Pauls instinct to keep it simple:
Yes I would expect just a python script that executes the function in CI, can do anything idempotent it wants (black boxy) and a final/outer function just has to return true. So the build is always red or green as normal. But I think we probably don’t want to have to read or interpret results in CI, it should only be a boolean.
On reflection, just taking notes, it seems like the key test we need at first is a list of say 10 questions which cleverly exercise the RAG function.
I am seeing these phases of evaluation:
x / n questions are answered correctly
. For an initial test we can generate the questions because we are the experts in describing what the system can do.
At this stage, we can track our questions in markdown and just review the 10 questions ourselves as a checklist at this stage, let's not get stuck on automation. We don't need to defend against regressions yet since we have nothing to regress; most of our learning will probably come from open-ended exploration.
This list of verifiable questions is very similar to the task analysis we need to do with the team. Of course, the tasks they do should inform what we provide. And these questions need to be transparent to Sonothera. As they find use cases which they can not answer, we want to capture those as this type of basic accuracy metric. But on the other hand, they are depending on us to tell them what this system can do, so I think we
After the initial prototyping phase — We add third party evaluation tools to integrate a range of metrics like precision, F1 scores and dimensions like reliability, coherence, creativity, most of which have existing metrics that are defined (?)
Longer term — Automate evaluation more, grow the suite. Keep verifying new functionality and clarifying which use cases are known to fail. In this sense we want to identify the edge of what is possible / safe and then clearly indicate that to the users in the UI.
One thing I am getting my head around: Correct answers would be nondeterministic so if we want to automate it, we would need to probably regex for entity matches or make the output executable somehow. The questions would have to be designed to elicit an entity or terminology that can be reliably detected by "something dumber than an LLM." It has to output something that is verifiable right. (I think this will be interesting and mildly difficult to do in a way that is useful as a test; it will probably be easy to think of questions that always pass or always fail, but we want to find the tractable "edge" of functionality where our changes are sometimes failing if we do it wrong. We want to find tests that fail if we don't have the right documents loaded and parsed correctly. That is, stock GPT-4 should fail the entire suite.)
This looks promising
https://github.com/explodinggradients/ragas
Here we assume that you already have your RAG pipeline ready. When it comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of these should also impact your pipelines' quality.
First, decide on one parameter that you're interested in adjusting. for example the number of retrieved documents, K. Collect a set of sample prompts (min 20) to form your test set. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output. Run ragas evaluation for each of them to generate evaluation scores. Compare the scores and you will know how much the change has affected your pipelines' performance.
Apparently the Promptfoo effort is not shut down they just archived one of their repos.
Just found this "factuality evaluation" section in their docs, and it explains how to test with question/answer pairs and an openAI model that does the evaluation.
https://github.com/openai/evals/blob/main/docs/build-eval.md
This document walks through the end-to-end process for building an eval, which is a dataset and a choice of eval class. The examples folder contains Jupyter notebooks that follow the steps below to build several academic evals, thus helping to illustrate the overall process.
TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a change in metrics like deception. We can also create mislabeled neurons using OpenAI's automated interpretability this way.
https://gpt-index.readthedocs.io/en/v0.6.33/how_to/evaluation/evaluation.html
LlamaIndex offers a few key modules for evaluating the quality of both Document retrieval and response synthesis. Here are some key questions for each component:
- Document retrieval: Are the sources relevant to the query?
- Response synthesis: Does the response match the retrieved context? Does it also match the query?
This guide describes how the evaluation components within LlamaIndex work. Note that our current evaluation modules do not require ground-truth labels. Evaluation can be done with some combination of the query, context, response, and combine these with LLM calls.
https://github.com/anyscale/ray-summit-2023-training/blob/main/Ray-LlamaIndex/notebooks/02_evaluation.ipynb this is a useful step-by-step guide to evaluating a RAG workflow. the last part on Evaluation without Golden Responses overlaps with a couple of the concepts Chris shared but put them into practice together.
I'm sort of imagining a qualititative process here rather than some sort of automated harness connected to one of the more generic NLP eval metics. Those are great for LLM-level broad comparisons, but seem pretty useless for specialized use cases like ours.