getappmap / navie-benchmark

Navie benchmarks
MIT License
0 stars 0 forks source link

Test errors in `generate_code` are malformed #63

Open dustinbyrne opened 2 months ago

dustinbyrne commented 2 months ago

To find examples of this issue, look for instances of <test-errors> within generate.input.txt logs. It would appear that generate is called with malformed test_errors.

The trajectories confirm the test errors returned from summarize_test_errors are valid.

E.g.:

## Preventing test errors

Generate code that avoids the following test errors:

<test-errors>
1
e
O
.
x
)
n
y
m
S
3
g
E
u
z
r
b
q
w
t
a
P
D

`
M
4
G
/
=
#
o
>
T
,
6
l
U
N
L
5
s
0
%
h
:
F
c
I
(
A
k
]
2
"
j
'
v
[
p
B
7
C
<
i
8
9
f

d
_
</test-errors>
github-actions[bot] commented 2 months ago

Title

Correct malformed test_errors in generate_code calls

Problem

The function generate_code is being called with malformed test_errors, causing the <test-errors> section in the generated input text (generate.input.txt) to be filled with incoherent characters and symbols.

Analysis

The issue lies in the way test_errors is being passed around and utilized in the code base. The malformed output suggests that either:

  1. The test_errors is garbled when being generated or formatted, or
  2. There is an issue in how test_errors are collected or aggregated from different sources before being passed to generate_code.

Given that the trajectories confirm the test errors returned from summarize_test_errors are valid, the problem likely exists somewhere after the summarize_test_errors function is called and before the generate_code function is invoked with these errors.

Proposed Changes

  1. solver/workflow/summarize_test_errors.py: Ensure summarize_test_errors is returning a properly formatted list of error strings by logging its output for verification.

  2. solver/workflow/generate_and_validate_code.py: Modify the code that collects test_errors to ensure it aggregates the errors correctly. Add logging to verify the content of the test_errors set before passing it to generate_code.

  3. solver/workflow/solve_code.py: Ensure that the test_errors parameter passed to generate_code is properly formatted. Add verification to check the integrity and format of test_errors.

Here are the suggested changes for each file component:

  1. solver/workflow/summarize_test_errors.py:

    • Verify the output of summarize_test_errors function to ensure it returns a correctly formatted string.
    • Add debugging logs to inspect the intermediate variables that play a role in formatting output.
  2. solver/workflow/generate_and_validate_code.py:

    • In the generate_and_validate_code function, ensure test_errors are aggregated correctly.
    • Add logging statements to inspect the collected test_errors before they are passed to the generate_code.

    Here is the section to focus on from:

    accumulator.extend(results)
    return accumulator
    
    test_errors = set()
    
    def collect_errors(work_dir: WorkDir, run_test_result: RunTestResult):
       if run_test_result.test_status == TestStatus.ERROR:
           if run_test_result.test_output:
               test_errors.update(
                   summarize_test_errors(work_dir, run_test_result.test_output)
               )
  3. solver/workflow/solve_code.py:

    • Review and update the generate_code method to ensure test_errors is properly formatted and evaluated before being utilized.
    • Add a check to ensure the test_errors list is well-formed before it is used in generating the code.

    Example section of code:

    def generate_code(
       self, work_dir: WorkDir, plan: str, test_errors: List[str]
    ) -> Optional[Patch]:
       self.clean_git_state()
    
       generator = GenerateCode(
           self.log,
           work_dir,
           self.trajectory_file,
           plan,
           self.python_version,
           self.limits.file_limit,
       )
       def generate(attempt, lint_errors: List[str]):
           code = generator.generate(attempt, lint_errors, test_errors)
           # Add logging to inspect test_errors here
           self.log("debug", f"Test errors passed to code generation: {test_errors}")
           return generator.apply(attempt, code)

By verifying and correcting the test_errors at these points, you should be able to ensure that the generate_code calls receive properly formatted error messages, resulting in correctly generated input text.

kgilpin commented 1 month ago

I haven't been able to reproduce this :-(