Evals for Symbol Retrieval

voynow commented 11 months ago

Based on automata/tests/eval/test_eval_code_writing.py As part of the evaluation framework, there's a need for a new feature for evaluating code retrieval given a specific query. The goal is to assess the ability of the system to fetch the most relevant symbol in response to a given query.

Requirements:

Input Query: A string that represents the user's question or instruction, similar to the existing instructions input. Expected Symbol: This will be the expected code that should ideally be retrieved in response to the input query. This serves a similar purpose to expected_actions. Quantification of Closeness: A mechanism to quantify how close the retrieved code is to the expected symbol. The exact methodology for this quantification is still to be determined. Potential strategies could include:

expected symbol == Top retrieved symbol
expected symbol in Top 5 retrieved symbols

Data Source: The code should be retrieved from an existing .scip file (or similar sources if deemed necessary).

Proposed Steps:

Define the new input structure that will include the query and the expected symbol. Develop an algorithm or utilize existing libraries to compare the retrieved code with the expected symbol. Integrate this evaluation into the existing framework ensuring compatibility with current components. Test the new feature with representative examples to ensure its functionality and accuracy.

Potential Challenges:

Defining a robust method for quantifying the quality of retrieval Handling ambiguities or multiple valid answers to a single query

Next Steps:

Make evals more flexible to handle extensions similar to the extension mentioned above Add eval for quantifying vanilla code retrieval (retrieval augmented generation, RAG) for SymbolRank benchmarking

This issue serves as a baseline for discussions and iterations. Feedback, clarifications, and suggestions are highly encouraged to refine the requirements and implementation details.

emrgnt-cmplxty commented 11 months ago

Awesome, just seeing this - I will take a look and implement something here asap.

voynow commented 11 months ago

Cool let me know if you have any feedback on this too. I was trying to be slightly specific but did gloss over a ton of details.

emrgnt-cmplxty commented 11 months ago

addressed here - https://github.com/emrgnt-cmplxty/automata/compare/feature/extend-eval-abstraction?expand=1

emrgnt-cmplxty commented 11 months ago

key logic is in here - https://github.com/emrgnt-cmplxty/automata/blob/4c43d052f7a76a180769892475093ada82d0b050/automata/eval/base.py.

I've broken the "eval" into a further layer of abstraction so that we an implement AgentEval and ToolEval. ToolEval still needs some implementing, perhaps this is where your can hop in?

emrgnt-cmplxty / automata