ComposioHQ / composio

Composio equips agents with well-crafted tools empowering them to tackle complex tasks
https://docs.composio.dev
Other
1.3k stars 427 forks source link

Update benchmarking scaffold #256

Closed kaavee315 closed 3 days ago

kaavee315 commented 4 days ago

PR Type

Enhancement, Tests


Description


Changes walkthrough ๐Ÿ“

Relevant files
Enhancement
toolset.py
Refactor workspace handling and logging in `ComposioToolSet`

python/composio/tools/toolset.py
  • Removed workspace_id and workspace_env attributes from the
    constructor.
  • Added set_workspace_id method to set and retrieve workspace.
  • Adjusted logging and workspace retrieval logic.
  • +8/-4     
    run_evaluation.py
    Refactor benchmark evaluation and logging setup                   

    python/swe/benchmark/run_evaluation.py
  • Added run_and_get_scores function to run agent and get scores.
  • Refactored run function to accept agent_func as a parameter.
  • Updated logging setup to use get_logger.
  • Enhanced error handling and logging for patch retrieval.
  • +63/-13 
    Miscellaneous
    benchmark.template
    Remove obsolete benchmark template                                             

    python/swe/composio_swe/scaffold/templates/crewai/benchmark.template - Removed the `benchmark.template` file.
    +0/-5     
    Tests
    run_benchmark.template
    Add new benchmark template for running evaluations             

    python/swe/composio_swe/scaffold/templates/crewai/run_benchmark.template
  • Added new run_benchmark.template file.
  • Defined agent_func for workspace setup and issue kickoff.
  • Implemented main function to run benchmark.
  • +25/-0   

    ๐Ÿ’ก PR-Agent usage: Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    ellipsis-dev[bot] commented 4 days ago

    Your free trial has expired. To keep using Ellipsis, sign up at https://app.ellipsis.dev for $20/seat/month or reach us at help@ellipsis.dev

    codiumai-pr-agent-pro[bot] commented 4 days ago

    PR Reviewer Guide ๐Ÿ”

    โฑ๏ธ Estimated effort to review: 4 ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ตโšช
    ๐Ÿงช No relevant tests
    ๐Ÿ”’ No security concerns identified
    โšก Key issues to review

    **Refactoring Concern:** The removal of `workspace_id` and `workspace_env` from the constructor in `toolset.py` and the addition of `set_workspace_id` might introduce issues if the workspace ID is used before being set in scenarios not covered by the PR. **Error Handling:** The new error handling and logging enhancements in `run_evaluation.py` need to be carefully reviewed to ensure that they correctly capture and log errors without missing any exceptions. **Functionality Change:** The change from `SHELL_EXECUTE_COMMAND` to `SHELL_EXEC_COMMAND` in `run_evaluation.py` could potentially alter the behavior of command execution, depending on the implementation details of these actions.
    codiumai-pr-agent-pro[bot] commented 4 days ago

    PR Code Suggestions โœจ

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add a check to ensure logs_dir is not None before using it ___ **Add a check to ensure logs_dir is not None before using it in the run_and_get_scores
    function to avoid potential NoneType errors.** [python/swe/benchmark/run_evaluation.py [206-212]](https://github.com/ComposioHQ/composio/pull/256/files#diff-2ca975beeb50842c50aec85982c675e2996103878ddf6f23ad806d5005559f60R206-R212) ```diff +if logs_dir is None: + raise ValueError("logs_dir cannot be None") logger.info("Running agent with logs_dir: %s", logs_dir) run( agent_func=agent_func, test_split=test_split, include_hints=include_hints, logs_dir=logs_dir, ) ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 8 Why: The suggestion to check if `logs_dir` is not `None` before using it in logging and function calls is crucial to prevent runtime errors, making it a significant improvement.
    8
    Add error handling to catch and log exceptions during the kickoff process ___ **Add error handling in the agent_func to catch and log any exceptions that occur during the
    kickoff process.** [python/swe/composio_swe/scaffold/templates/crewai/run_benchmark.template [10-14]](https://github.com/ComposioHQ/composio/pull/256/files#diff-4fbbf5f40403805e0ab3f23c37618ee8aa6fee137ce0d760f5ffd6997d54d1c4R10-R14) ```diff -return crew.kickoff( - { - "issue_id": issue_config.issue_id, - "issue": issue_config.issue_desc, - } -) +try: + return crew.kickoff( + { + "issue_id": issue_config.issue_id, + "issue": issue_config.issue_desc, + } + ) +except Exception as e: + logger.error(f"Error during kickoff: {e}") + raise ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Adding error handling in the `agent_func` to catch exceptions during the `kickoff` process is a good practice to ensure robustness and easier debugging.
    7
    Enhancement
    Add a type hint for the agent_func parameter to improve code clarity and type safety ___ **Add a type hint for the agent_func parameter in the run_and_get_scores function to improve
    code clarity and type safety.** [python/swe/benchmark/run_evaluation.py [204]](https://github.com/ComposioHQ/composio/pull/256/files#diff-2ca975beeb50842c50aec85982c675e2996103878ddf6f23ad806d5005559f60R204-R204) ```diff -def run_and_get_scores(agent_func, test_split="1:50", include_hints=True): +def run_and_get_scores(agent_func: t.Callable, test_split="1:50", include_hints=True): ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 7 Why: Adding a type hint for `agent_func` as `t.Callable` in `run_and_get_scores` function enhances code clarity and helps with type checking, which is a good practice in Python.
    7
    Best practice
    Use the not operator to check for None values in a more Pythonic way ___ **Instead of checking workspace_id for None directly, use the not operator for a more
    Pythonic approach.** [python/composio/tools/toolset.py [73]](https://github.com/ComposioHQ/composio/pull/256/files#diff-89b5c0aec15f01eb54ad26f78731b68f650e438a443378282b436df85abde51dR73-R73) ```diff -if workspace_id is None: +if not workspace_id: ``` - [ ] **Apply this suggestion**
    Suggestion importance[1-10]: 6 Why: The suggestion to use `if not workspace_id:` instead of `if workspace_id is None:` is a valid Pythonic improvement for readability, but it's not crucial.
    6
    codiumai-pr-agent-pro[bot] commented 4 days ago

    CI Failure Feedback ๐Ÿง

    (Checks updated until commit https://github.com/ComposioHQ/composio/commit/19fe7619349735c241ab738e716221daf1cb9307)

    **Action:** test (ubuntu-latest, 3.10)
    **Failed stage:** [Unittests](https://github.com/ComposioHQ/composio/actions/runs/9818250580/job/27110567188) [โŒ]
    **Failed test name:** test_list_all
    **Failure summary:** The action failed because the test test_list_all in the file tests/test_cli/test_actions.py failed.
    The test failed due to an AssertionError indicating a server error response:
  • The expected exit code was 0, but the actual exit code was 1.
  • The error message included an HTML response indicating a "503 Service Unavailable" error, suggesting
    that the application failed to respond.
  • Relevant error logs: ```yaml 1: ##[group]Operating System 2: Ubuntu ... 496: * [new branch] featembed-tool -> origin/featembed-tool 497: * [new branch] fix/readme -> origin/fix/readme 498: * [new branch] fix/readme-logo -> origin/fix/readme-logo 499: * [new branch] fix/swe-agent -> origin/fix/swe-agent 500: * [new branch] ft-add-better-help-text -> origin/ft-add-better-help-text 501: * [new branch] ft-apps-id -> origin/ft-apps-id 502: * [new branch] ft-bring-back-core-sdk -> origin/ft-bring-back-core-sdk 503: * [new branch] ft-did-you-mean -> origin/ft-did-you-mean 504: * [new branch] ft-error-tracking -> origin/ft-error-tracking ... 892: tests/test_example.py::test_example[example0] SKIPPED (Testing in CI 893: will lead to too much LLM API usage) [ 4%] 894: tests/test_example.py::test_example[example1] SKIPPED (Testing in CI 895: will lead to too much LLM API usage) [ 6%] 896: tests/test_example.py::test_example[example2] SKIPPED (Testing in CI 897: will lead to too much LLM API usage) [ 9%] 898: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments0-exptected_outputs0-unexptected_outputs0] PASSED [ 11%] 899: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments1-exptected_outputs1-unexptected_outputs1] PASSED [ 13%] 900: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments2-exptected_outputs2-unexptected_outputs2] FAILED [ 15%] 901: tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments3-exptected_outputs3-unexptected_outputs3] PASSED [ 18%] 902: tests/test_cli/test_actions.py::TestListActions::test_tag_not_found PASSED [ 20%] 903: tests/test_cli/test_actions.py::TestListActions::test_limit SKIPPED [ 22%] 904: tests/test_cli/test_actions.py::TestListActions::test_copy PASSED [ 25%] 905: tests/test_cli/test_add.py::TestComposioAdd::test_no_auth PASSED [ 27%] 906: tests/test_cli/test_apps.py::TestList::test_list PASSED [ 29%] 907: tests/test_cli/test_apps.py::TestUpdate::test_update_not_required PASSED [ 31%] 908: tests/test_cli/test_apps.py::TestUpdate::test_update SKIPPED (Needs 909: investigation, this test fails in CI) [ 34%] ... 931: tests/test_tools/test_toolset.py::test_find_actions_by_tags PASSED [ 84%] 932: tests/test_tools/test_toolset.py::test_uninitialize_app PASSED [ 86%] 933: tests/test_utils/test_decorators.py::test_deprecated PASSED [ 88%] 934: tests/test_utils/test_git.py::test_get_git_user_info PASSED [ 90%] 935: tests/test_utils/test_shared.py::test_get_pydantic_signature_format_from_schema_params PASSED [ 93%] 936: tests/test_utils/test_shared.py::test_json_schema_to_pydantic_field PASSED [ 95%] 937: tests/test_utils/test_shared.py::test_json_schema_to_fields_dict PASSED [ 97%] 938: tests/test_utils/test_url.py::test_get_web_url PASSED [100%] 939: =================================== FAILURES =================================== ... 965: patch: t.Any, 966: arguments: t.Tuple[str, ...], 967: exptected_outputs: t.Tuple[str, ...], 968: unexptected_outputs: t.Tuple[str, ...], 969: ) -> None: 970: """Test list all actions.""" 971: result = self.run("actions", *arguments) 972: > assert result.exit_code == 0, result.stderr 973: E AssertionError: Error: 974: E 975: E 976: E 977: E 978: E Server Error ... 1031: E } 1032: E 1033: E h1 { 1034: E display: none; 1035: E margin-top: 2rem; 1036: E margin-bottom: 2rem; 1037: E } 1038: E 1039: E .error-503 { ... 1055: E fill="#fff" 1056: E /> 1057: E 1061: E 1062: E 1063: E

    Nothing here... yet

    1064: E

    Application failed to respond

    ... 1066: E Go to Railway 1067: E 1068: E 1069: E 1070: E 1071: E 1072: E assert 1 == 0 1073: E + where 1 = .exit_code 1074: tests/test_cli/test_actions.py:44: AssertionError ... 1218: composio/utils/shared.py 117 83 29% 44, 47-51, 54-58, 61-77, 83, 101-104, 153-158, 174-221, 247-292 1219: composio/utils/url.py 10 1 90% 35 1220: examples/crewai_ci_chart.py 15 15 0% 1-38 1221: -------------------------------------------------------------------------------------------------------------- 1222: TOTAL 7809 1139 85% 1223: Coverage HTML written to dir htmlcov 1224: Coverage XML written to file coverage.xml 1225: =========================== short test summary info ============================ 1226: FAILED tests/test_cli/test_actions.py::TestListActions::test_list_all[arguments2-exptected_outputs2-unexptected_outputs2] - AssertionError: Error: 1227: 1228: 1229: 1230: 1231: Server Error ... 1276: main a:hover { 1277: background-color: hsl(270, 60%, 42%); 1278: } 1279: h1 { 1280: display: none; 1281: margin-top: 2rem; 1282: margin-bottom: 2rem; 1283: } 1284: .error-503 { ... 1298: d="M4.756 438.175A520.713 520.713 0 0 0 0 489.735h777.799c-2.716-5.306-6.365-10.09-10.045-14.772-132.97-171.791-204.498-156.896-306.819-161.26-34.114-1.403-57.249-1.967-193.037-1.967-72.677 0-151.688.185-228.628.39-9.96 26.884-19.566 52.942-24.243 74.14h398.571v51.909H4.756ZM783.93 541.696H.399c.82 13.851 2.112 27.517 3.978 40.999h723.39c32.248 0 50.299-18.297 56.162-40.999ZM45.017 724.306S164.941 1018.77 511.46 1024c207.112 0 385.071-123.006 465.907-299.694H45.017Z" 1299: fill="#fff" 1300: /> 1301: 1305: 1306:

    Nothing here... yet

    1307:

    Application failed to respond

    1308: Go to Railway 1309: 1310: 1311: 1312: assert 1 == 0 1313: + where 1 = .exit_code 1314: ============= 1 failed, 38 passed, 5 skipped, 1 warning in 22.91s ============== 1315: unittests: exit 1 (23.88 seconds) /home/runner/work/composio/composio/python> pytest -vvv -rfE --doctest-modules composio/ tests/ --cov=composio --cov=examples --cov-report=html --cov-report=xml --cov-report=term --cov-report=term-missing --cov-config=.coveragerc pid=5668 1316: .pkg: _exit> python /opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta __legacy__ 1317: unittests: FAIL code 1 (48.55=setup[19.02]+cmd[5.65,23.88] seconds) 1318: evaluation failed :( (48.79 seconds) 1319: ##[error]Process completed with exit code 1. ```

    โœจ CI feedback usage guide:
    The CI feedback tool (`/checks)` automatically triggers when a PR has a failed check. The tool analyzes the failed checks and provides several feedbacks: - Failed stage - Failed test name - Failure summary - Relevant error logs In addition to being automatically triggered, the tool can also be invoked manually by commenting on a PR: ``` /checks "https://github.com/{repo_name}/actions/runs/{run_number}/job/{job_number}" ``` where `{repo_name}` is the name of the repository, `{run_number}` is the run number of the failed check, and `{job_number}` is the job number of the failed check. #### Configuration options - `enable_auto_checks_feedback` - if set to true, the tool will automatically provide feedback when a check is failed. Default is true. - `excluded_checks_list` - a list of checks to exclude from the feedback, for example: ["check1", "check2"]. Default is an empty list. - `enable_help_text` - if set to true, the tool will provide a help message with the feedback. Default is true. - `persistent_comment` - if set to true, the tool will overwrite a previous checks comment with the new feedback. Default is true. - `final_update_message` - if `persistent_comment` is true and updating a previous checks message, the tool will also create a new message: "Persistent checks updated to latest commit". Default is true. See more information about the `checks` tool in the [docs](https://pr-agent-docs.codium.ai/tools/ci_feedback/).