Open ccstan99 opened 1 year ago
Might be worthwhile to still have API functionality to automate testing & benchmarking quality of: (1) semantic search retrieval and (2) also generated answer to standalone questions
Ideally, it would be helpful to have a script that accepts a google sheet or text file as input with list of benchmark questions. The output are results on a separate spreadsheet tab or some format with a timestamp in the filename. The file should be easily imported into a spreadsheet with the following columns for each benchmark question:
If the output is CSV or TSV, make sure delimiters work and characters in text data are properly escaped. Essentially, we should be able to reconstruct the entire input to the LLM as well as its generated output.
For now, let's focus on developing a solid prompt for standalone questions and assume chat history is empty. When we start debugging conversations, we'll need to rephrase the query based on the chat history before fetching context blocks and also log that.
This spreadsheet with sample inputs and outputs can also be used as a central test "hub" where problematic questions can be added to the list of benchmark questions. New tabs can be added to track candidate prompts that are worth evaluating.
Prompt : Can you explain the t-AGI framework? Generated output: an explainer of t-AGI as tool AGI Expected output : Distillation of Clarifying and predicting AGI
I tried another time after clearing the cache.
Prompt: what is t-AGI? Generated output: an explainer of t-AGI as task directed AGI Expected output : Distillation of Clarifying and predicting AGI
@markovial The current chatbot uses the dataset from June 2022. That article is from May 2023. We'll want to check again after we start using the new set of embeddings. Submitting an issues on our form https://bit.ly/stampy-chat-issues will automatically log it in the spreadsheet's reported_problems
tab, which I've already done for this instance. I've also added it to the benchmark questions so we can continue to monitor this.
https://chat.stampy.ai/tester can be used to do this