mprast commented 3 weeks ago

XRX Tree Tester

This PR adds a library that for instrumenting and evaluating the performance (re: correctness, not speed) of xRx in certain scenarios. There is a library called TreeTestRunner that can be used to a) create a hierarchy of nodes representing the agent execution flow b) run an LLM to answer a series of qualitative questions about each node to determine if the agent successfully completed the scenario. The idea is that by representing agent execution as a tree, we can get fine-grained information about the success or failure of each node, and whether and how those nodes contributed to the (in)ability of the agent to successfully complete the scenario.

For demonstration purposes, I've use the TreeTestRunner to instrument shopify-app. TreeTestRunner supports arbitrary trees; the structure I've chosen is:

- one root node at the top level, representing a conversation between the user and the agent
- many request-response nodes at the second level. Each of these nodes captures a message from the user to the agent and an associated response from the agent back to the user.
- many graph-specific nodes below each request-response node. Each of these corresponds to a specific reasoning task the agent undertakes while formulating a response to the user (`routing`, `chooseTool`, etc.)

Testing

To see TreeTestRunner in action, start shopify-app and interact with the app normally (I'd recommend using interactive_test.py.) When you are finished, say or type 'stop'. Wait for a bit for the tests to run (this may take a while, especially if you get rate-limited by the LLM provider.) After the tests are run the frontend will crash, but you should see

xrx-reasoning     | 2024-11-09 21:53:19,146 INFO:Done writing file!
xrx-reasoning     | 2024-11-09 21:53:19,149 INFO:Done writing simple file!

in the docker-compose output. At this point you can find the output of your test in shopify-app/testOutput. There will be two files, one of the form <name>.json and one of the form <name>-simple.json. The first one is raw output and the second is cleaned up a bit to be more pleasant to read. I recommend using a json prettifier to look at these.

The description of the tests to run is in shopify-app/reasoning/app/treeTestConfig/buySomethingTest.yaml. It includes the goal that the agent is supposed to fulfill, as well as a list of audits for every possible stage. The format should be self-explanatory, but just leave a comment if it's not and I can document it. Feel free to add config for your own scenarios; you can change the test being run by changing the pathname in line 40 of executor.py

You can find some sample outputs under shopify-app/reasoning/app/agent/sample_tests. These were all generated by me interacting with the agent.

TODOs & Cleanup Work

fix naming - I'm used to typescript these days so I keep reflexively using camel case instead of snake case
Perhaps find a way to persist the test runner across requests that doesn't involve pickling
remove various debug cruft (pdb imports)

Next Steps

~Build an LLM-backed "synthetic user" that can interact with the agent based on a prompt~
Push the tests to S3 instead of writing them locally
Come up with a good way to summarize multiple tests - the evaluations are predictably stochastic so we'll likely need to run them lots of times to get a stable signal.
Modify this to accommodate "black box" approaches where we aren't able to instrument the source and instead have to parse the raw output of the system into a tree.
Anything else I'm forgetting!

mprast commented 2 weeks ago

Update: Synthetic User Harness

I've added a harness that will use an LLM to generate user messages to xRx. Now you can have the surprisingly awkward experience of watching a robot talk to a robot while you mill around at your desk

Testing

To try this out yourself:

pull this branch
nav to xrx-sample-apps/shopify-app
make sure .env is populated
make sure you're running in a virtual env and then
- pip install -r reasoning/requirements.txt
- pip install -r test/requirements.txt
- python test/synthetic_user_test.py

You must provide a goal to the synthetic user. The user will do its best to satisfy the goal by talking to the agent, stopping when it either 1) determines that it has completed its task or 2) loses confidence that the agent can help it achieve its goal.

To see an example of the first case, try: Your goal is to order a brooklyn-style pizza. To see an example of the second case, try: Your goal is to buy a wet pizza. It must be wet.

The synthetic user seems generally inclined to give the agent the benefit of the doubt, so in the second case it may take a while before it gets frustrated and gives up. The agent also gets confused and crashes with this prompt occasionally, so you may need to run it a few times to see the synthetic user quit.

alessandro-neri commented 2 weeks ago

[Feature] Enhance Reasoning Agent and Expand Testing Suite in xRx Framework

1. Overview

What is the feature?
This feature introduces significant enhancements to the reasoning agent within the xRx framework. It includes updates to the Docker configurations, executor logic, graph traversal mechanisms, and the addition of comprehensive testing suites. These changes aim to improve the agent's performance, maintainability, and reliability by incorporating advanced reasoning capabilities and extensive automated tests.
What changed?
- Reasoning Agent Logic: Enhanced executor and graph traversal for better reasoning processes.
- Testing Infrastructure: Added a robust set of synthetic and interactive tests to ensure system reliability.
- Docker Configuration: Updated Dockerfiles and docker-compose settings to support new functionalities.

2. Files Modified

File Name	Changes
`docker-compose.yaml`	• Context: Configures Docker services for the Shopify app. • Changes [EDIT]: Added new services and updated volume mappings to support the enhanced reasoning agent and testing frameworks.
`Dockerfile`	• Context: Builds the Docker image for the reasoning component. • Changes [EDIT]: Included additional dependencies and set up directories required for the new reasoning functionalities.
`executor.py`	• Context: Handles the execution of reasoning tasks. • Changes [EDIT]: Refactored the executor logic to incorporate new observability features and improved error handling mechanisms. Added methods for encoding and decoding session data.
`base.py`	• Context: Defines the base graph structure for reasoning. • Changes [EDIT]: Modified the traversal method to include a new parameter for request-response nodes. Enhanced the node execution process with better logging and task tracking.
`main.py`	• Context: Manages the main graph traversal logic. • Changes [EDIT]: Updated the `agent_graph` function to handle additional parameters, improving the integration with the request-response mechanism.
`buyingFlow.json`	• Context: Sample test case for purchasing flow. • Changes [NEW]: Added a new JSON file to test the buying flow scenario for a Brooklyn-style pizza.
`buyingFlow_simple.json`	• Context: Simplified version of the buying flow test. • Changes [NEW]: Introduced a simplified test case to validate basic purchasing interactions.
`distraction.json`	• Context: Test case focusing on handling distractions. • Changes [NEW]: Added a JSON file to test the agent's ability to manage and recover from distracted user inputs.
`distraction_simple.json`	• Context: Simplified distraction handling test. • Changes [NEW]: Introduced a basic test scenario to assess the agent's response to distractions.
`request_response.json`	• Context: Test case for request-response interactions. • Changes [NEW]: Added a JSON file to evaluate the agent's handling of direct user requests and appropriate responses.
`request_response_simple.json`	• Context: Simplified request-response test. • Changes [NEW]: Created a basic test scenario to verify straightforward request-response flows.
`buySomethingTest.yaml`	• Context: Configuration for the buying flow test. • Changes [NEW]: Added YAML configuration to define the user goal and audits for the buying something test scenario.
`defaultAudits.yaml`	• Context: Default audit questions for various stages. • Changes [NEW]: Introduced a comprehensive set of audit questions to evaluate different stages of the reasoning process.
`requirements.txt`	• Context: Specifies Python dependencies. • Changes [EDIT]: Added new dependencies required for the updated reasoning agent and testing frameworks.
`interactive_test.py`	• Context: Script for interactive testing. • Changes [EDIT]: Enhanced the testing script to handle verbose session outputs and redact sensitive information for better readability.
`test_requirements.txt`	• Context: Dependencies for testing. • Changes [EDIT]: Included additional packages necessary for running the expanded test suites.
`synthetic_user_test.py`	• Context: Synthetic user interaction tests. • Changes [NEW]: Added extensive synthetic user tests to simulate real-world interactions and validate agent responses under various scenarios.
`xrx_agent_framework/__init__.py`	• Context: Initializes the xRx agent framework. • Changes [EDIT]: Updated initialization logic to incorporate new testing utilities and framework enhancements.
`utils/testing.py`	• Context: Utility functions for testing. • Changes [REMOVED]: Deprecated outdated testing utilities no longer compatible with the enhanced testing suite.

3. Issues/Improvements

Security. Potential exposure of session data.

- **Specific security concern 1**: Session data is encoded and decoded using base64, which does not provide encryption. - **Specific mitigation needed**: Implement encryption for session data to protect sensitive information during transmission and storage.

Performance. Increased computational overhead due to enhanced reasoning.

- **Specific performance impact**: The refactored executor and graph traversal introduce additional processing steps, potentially affecting response times. - **Specific optimization needed**: Profile the new reasoning processes to identify bottlenecks and optimize code for better performance.

Maintainability. Complexity added to executor and graph components.

- **Specific maintenance concern**: The introduction of new methods and parameters in the executor and graph traversal increases code complexity. - **Specific improvement needed**: Refactor the codebase to modularize components further and add comprehensive documentation to aid future maintenance.

Simplification. Redundant testing scripts across different test files.

- **Specific simplification opportunity**: Multiple test scripts handle similar scenarios across various JSON files. - **Specific refactoring needed**: Consolidate similar test cases into shared testing modules to reduce redundancy and improve maintainability.

8090-inc / xrx-sample-apps

[draft] initial tree test pr #30

XRX Tree Tester

Testing

TODOs & Cleanup Work

Next Steps

Update: Synthetic User Harness

Testing

1. Overview

2. Files Modified

3. Issues/Improvements