Open mprast opened 3 weeks ago
I've added a harness that will use an LLM to generate user messages to xRx. Now you can have the surprisingly awkward experience of watching a robot talk to a robot while you mill around at your desk
To try this out yourself:
xrx-sample-apps/shopify-app
pip install -r reasoning/requirements.txt
pip install -r test/requirements.txt
python test/synthetic_user_test.py
You must provide a goal to the synthetic user. The user will do its best to satisfy the goal by talking to the agent, stopping when it either 1) determines that it has completed its task or 2) loses confidence that the agent can help it achieve its goal.
To see an example of the first case, try: Your goal is to order a brooklyn-style pizza.
To see an example of the second case, try: Your goal is to buy a wet pizza. It must be wet.
The synthetic user seems generally inclined to give the agent the benefit of the doubt, so in the second case it may take a while before it gets frustrated and gives up. The agent also gets confused and crashes with this prompt occasionally, so you may need to run it a few times to see the synthetic user quit.
[Feature] Enhance Reasoning Agent and Expand Testing Suite in xRx Framework
What is the feature?
This feature introduces significant enhancements to the reasoning agent within the xRx framework. It includes updates to the Docker configurations, executor logic, graph traversal mechanisms, and the addition of comprehensive testing suites. These changes aim to improve the agent's performance, maintainability, and reliability by incorporating advanced reasoning capabilities and extensive automated tests.
What changed?
File Name | Changes |
---|---|
docker-compose.yaml |
• Context: Configures Docker services for the Shopify app. • Changes [EDIT]: Added new services and updated volume mappings to support the enhanced reasoning agent and testing frameworks. |
Dockerfile |
• Context: Builds the Docker image for the reasoning component. • Changes [EDIT]: Included additional dependencies and set up directories required for the new reasoning functionalities. |
executor.py |
• Context: Handles the execution of reasoning tasks. • Changes [EDIT]: Refactored the executor logic to incorporate new observability features and improved error handling mechanisms. Added methods for encoding and decoding session data. |
base.py |
• Context: Defines the base graph structure for reasoning. • Changes [EDIT]: Modified the traversal method to include a new parameter for request-response nodes. Enhanced the node execution process with better logging and task tracking. |
main.py |
• Context: Manages the main graph traversal logic. • Changes [EDIT]: Updated the agent_graph function to handle additional parameters, improving the integration with the request-response mechanism. |
buyingFlow.json |
• Context: Sample test case for purchasing flow. • Changes [NEW]: Added a new JSON file to test the buying flow scenario for a Brooklyn-style pizza. |
buyingFlow_simple.json |
• Context: Simplified version of the buying flow test. • Changes [NEW]: Introduced a simplified test case to validate basic purchasing interactions. |
distraction.json |
• Context: Test case focusing on handling distractions. • Changes [NEW]: Added a JSON file to test the agent's ability to manage and recover from distracted user inputs. |
distraction_simple.json |
• Context: Simplified distraction handling test. • Changes [NEW]: Introduced a basic test scenario to assess the agent's response to distractions. |
request_response.json |
• Context: Test case for request-response interactions. • Changes [NEW]: Added a JSON file to evaluate the agent's handling of direct user requests and appropriate responses. |
request_response_simple.json |
• Context: Simplified request-response test. • Changes [NEW]: Created a basic test scenario to verify straightforward request-response flows. |
buySomethingTest.yaml |
• Context: Configuration for the buying flow test. • Changes [NEW]: Added YAML configuration to define the user goal and audits for the buying something test scenario. |
defaultAudits.yaml |
• Context: Default audit questions for various stages. • Changes [NEW]: Introduced a comprehensive set of audit questions to evaluate different stages of the reasoning process. |
requirements.txt |
• Context: Specifies Python dependencies. • Changes [EDIT]: Added new dependencies required for the updated reasoning agent and testing frameworks. |
interactive_test.py |
• Context: Script for interactive testing. • Changes [EDIT]: Enhanced the testing script to handle verbose session outputs and redact sensitive information for better readability. |
test_requirements.txt |
• Context: Dependencies for testing. • Changes [EDIT]: Included additional packages necessary for running the expanded test suites. |
synthetic_user_test.py |
• Context: Synthetic user interaction tests. • Changes [NEW]: Added extensive synthetic user tests to simulate real-world interactions and validate agent responses under various scenarios. |
xrx_agent_framework/__init__.py |
• Context: Initializes the xRx agent framework. • Changes [EDIT]: Updated initialization logic to incorporate new testing utilities and framework enhancements. |
utils/testing.py |
• Context: Utility functions for testing. • Changes [REMOVED]: Deprecated outdated testing utilities no longer compatible with the enhanced testing suite. |
XRX Tree Tester
This PR adds a library that for instrumenting and evaluating the performance (re: correctness, not speed) of xRx in certain scenarios. There is a library called TreeTestRunner that can be used to a) create a hierarchy of nodes representing the agent execution flow b) run an LLM to answer a series of qualitative questions about each node to determine if the agent successfully completed the scenario. The idea is that by representing agent execution as a tree, we can get fine-grained information about the success or failure of each node, and whether and how those nodes contributed to the (in)ability of the agent to successfully complete the scenario.
For demonstration purposes, I've use the TreeTestRunner to instrument
shopify-app
. TreeTestRunner supports arbitrary trees; the structure I've chosen is:Testing
To see TreeTestRunner in action, start shopify-app and interact with the app normally (I'd recommend using
interactive_test.py
.) When you are finished, say or type 'stop'. Wait for a bit for the tests to run (this may take a while, especially if you get rate-limited by the LLM provider.) After the tests are run the frontend will crash, but you should seein the docker-compose output. At this point you can find the output of your test in
shopify-app/testOutput
. There will be two files, one of the form<name>.json
and one of the form<name>-simple.json
. The first one is raw output and the second is cleaned up a bit to be more pleasant to read. I recommend using a json prettifier to look at these.The description of the tests to run is in
shopify-app/reasoning/app/treeTestConfig/buySomethingTest.yaml
. It includes the goal that the agent is supposed to fulfill, as well as a list of audits for every possible stage. The format should be self-explanatory, but just leave a comment if it's not and I can document it. Feel free to add config for your own scenarios; you can change the test being run by changing the pathname in line 40 ofexecutor.py
You can find some sample outputs under
shopify-app/reasoning/app/agent/sample_tests
. These were all generated by me interacting with the agent.TODOs & Cleanup Work
Next Steps