Conduct systematic testing

ccstan99 commented 1 year ago

Develop a set of benchmark questions to evaluate effectiveness of prompts and parameters.
Solicit editors and general user feedback to stress test conversational functionality

ccstan99 commented 1 year ago

Might be worthwhile to still have API functionality to automate testing & benchmarking quality of: (1) semantic search retrieval and (2) also generated answer to standalone questions

ccstan99 commented 1 year ago

Ideally, it would be helpful to have a script that accepts a google sheet or text file as input with list of benchmark questions. The output are results on a separate spreadsheet tab or some format with a timestamp in the filename. The file should be easily imported into a spreadsheet with the following columns for each benchmark question:

Query
Generated Response
Citations
Prompt Template
Source Block 1
Source Block 2
...
Source Block N

If the output is CSV or TSV, make sure delimiters work and characters in text data are properly escaped. Essentially, we should be able to reconstruct the entire input to the LLM as well as its generated output.

For now, let's focus on developing a solid prompt for standalone questions and assume chat history is empty. When we start debugging conversations, we'll need to rephrase the query based on the chat history before fetching context blocks and also log that.

This spreadsheet with sample inputs and outputs can also be used as a central test "hub" where problematic questions can be added to the list of benchmark questions. New tabs can be added to track candidate prompts that are worth evaluating.

markovial commented 1 year ago

Prompt : Can you explain the t-AGI framework? Generated output: an explainer of t-AGI as tool AGI Expected output : Distillation of Clarifying and predicting AGI

I tried another time after clearing the cache.

Prompt: what is t-AGI? Generated output: an explainer of t-AGI as task directed AGI Expected output : Distillation of Clarifying and predicting AGI

ccstan99 commented 1 year ago

@markovial The current chatbot uses the dataset from June 2022. That article is from May 2023. We'll want to check again after we start using the new set of embeddings. Submitting an issues on our form https://bit.ly/stampy-chat-issues will automatically log it in the spreadsheet's reported_problems tab, which I've already done for this instance. I've also added it to the benchmark questions so we can continue to monitor this.

mruwnik commented 1 year ago

https://chat.stampy.ai/tester can be used to do this

StampyAI / stampy-chat

Conduct systematic testing #6