getappmap / navie-benchmark

Navie benchmarks
MIT License
0 stars 0 forks source link

Propagate test failure messages from previously generated test case attempts to the code generator #58

Open kgilpin opened 2 months ago

kgilpin commented 2 months ago

The generate_patch method collects test errors when a test status is ERROR. These errors are then passed in to subsequent iterations of the code generator, to advise the LLM about how to avoid the errors.

Consider collecting additional information and propagating it to future solve attempts:

Here's an example in which the LLM just keeps making the same mistake, because we don't propagate the test errors:

scikit-learn__scikit-learn-13124.zip

github-actions[bot] commented 2 months ago

Title: Improve Propagation of Test Failure Information to Code Generator

Problem: Currently, the code generator's iterative process lacks effective propagation of critical information from previous attempts, which could guide subsequent code generation attempts. Specifically, when test cases fail, the associated failure messages are not effectively utilized, leading to repeated mistakes in code generation. This can cause stagnation in solving problems, as exemplified by the recurring errors in patch attempts.

Analysis: Improving the propagation of information from test failures could enhance the learning process of the code generator, allowing it to adapt its approach based on past attempts. The code generator should be aware of which patches have been previously tried and why they failed. In particular, test failure messages should be captured and used to direct the LLM to avoid similar errors. Additionally, understanding if a test expected to fail passed can provide insights into the logic errors of the generated code.

The main components that need consideration for capture and propagation include:

  1. Previous Patch Attempts: Keeping a record of all the patches applied so far so each new attempt can be checked against them to diversify strategies.
  2. Test Failure Messages: When a test fails, collect specific failure messages. This context allows the model to understand exactly what went wrong, such as exceptions raised, incorrect outputs, or unmet conditions.
  3. Pass/Fail Outcome Analysis: Understanding whether a pass_to_fail test last attempted erroneously passed and capturing any indications of why this may have occurred.

Proposed Changes:

    • Modify the method responsible for managing code patches within generate_patch to include keeping a history of attempted patches alongside their outcomes.
    • Add logic to store both the code attempts and associated test outcomes from each patch iteration.
    • Extend the error-collection logic to not only register when a TestStatus.ERROR occurs but also to gather and store specific test failure messages.
    • Enhance the current functionality that logs TestStatus.PASSED or FAILED specifically for pass_to_fail scenarios to record these outcomes in detail for feedback into the generator.
    • Modify the class GenerateTest to accommodate the history of patch attempts and to utilize the test failure messages directly for avoiding redundant errors.
    • Ensure the generate function integrates test script error information to guide changes towards successful patch results, adjusting iterative logic strategies based on prior failures explicitly documented.

By implementing these changes, the generator will better leverage historical data from prior attempts, reducing the likelihood of iterating over the same mistakes and increasing the chances of generating successful patches quickly.