Test errors in `generate_code` are malformed

dustinbyrne commented 2 months ago

To find examples of this issue, look for instances of <test-errors> within generate.input.txt logs. It would appear that generate is called with malformed test_errors.

The trajectories confirm the test errors returned from summarize_test_errors are valid.

E.g.:

## Preventing test errors

Generate code that avoids the following test errors:

<test-errors>
1
e
O
.
x
)
n
y
m
S
3
g
E
u
z
r
b
q
w
t
a
P
D

`
M
4
G
/
=
#
o
>
T
,
6
l
U
N
L
5
s
0
%
h
:
F
c
I
(
A
k
]
2
"
j
'
v
[
p
B
7
C
<
i
8
9
f

d
_
</test-errors>

github-actions[bot] commented 2 months ago

Title

Correct malformed test_errors in generate_code calls

Problem

The function generate_code is being called with malformed test_errors, causing the <test-errors> section in the generated input text (generate.input.txt) to be filled with incoherent characters and symbols.

Analysis

The issue lies in the way test_errors is being passed around and utilized in the code base. The malformed output suggests that either:

The test_errors is garbled when being generated or formatted, or
There is an issue in how test_errors are collected or aggregated from different sources before being passed to generate_code.

Given that the trajectories confirm the test errors returned from summarize_test_errors are valid, the problem likely exists somewhere after the summarize_test_errors function is called and before the generate_code function is invoked with these errors.

Proposed Changes

solver/workflow/summarize_test_errors.py: Ensure summarize_test_errors is returning a properly formatted list of error strings by logging its output for verification.
solver/workflow/generate_and_validate_code.py: Modify the code that collects test_errors to ensure it aggregates the errors correctly. Add logging to verify the content of the test_errors set before passing it to generate_code.
solver/workflow/solve_code.py: Ensure that the test_errors parameter passed to generate_code is properly formatted. Add verification to check the integrity and format of test_errors.

Here are the suggested changes for each file component:

solver/workflow/summarize_test_errors.py:
- Verify the output of summarize_test_errors function to ensure it returns a correctly formatted string.
- Add debugging logs to inspect the intermediate variables that play a role in formatting output.

solver/workflow/generate_and_validate_code.py:

In the generate_and_validate_code function, ensure test_errors are aggregated correctly.
Add logging statements to inspect the collected test_errors before they are passed to the generate_code.

Here is the section to focus on from:

accumulator.extend(results)
return accumulator

test_errors = set()

def collect_errors(work_dir: WorkDir, run_test_result: RunTestResult):
   if run_test_result.test_status == TestStatus.ERROR:
       if run_test_result.test_output:
           test_errors.update(
               summarize_test_errors(work_dir, run_test_result.test_output)
           )

solver/workflow/solve_code.py:

Review and update the generate_code method to ensure test_errors is properly formatted and evaluated before being utilized.
Add a check to ensure the test_errors list is well-formed before it is used in generating the code.

Example section of code:

def generate_code(
   self, work_dir: WorkDir, plan: str, test_errors: List[str]
) -> Optional[Patch]:
   self.clean_git_state()

   generator = GenerateCode(
       self.log,
       work_dir,
       self.trajectory_file,
       plan,
       self.python_version,
       self.limits.file_limit,
   )
   def generate(attempt, lint_errors: List[str]):
       code = generator.generate(attempt, lint_errors, test_errors)
       # Add logging to inspect test_errors here
       self.log("debug", f"Test errors passed to code generation: {test_errors}")
       return generator.apply(attempt, code)

By verifying and correcting the test_errors at these points, you should be able to ensure that the generate_code calls receive properly formatted error messages, resulting in correctly generated input text.

kgilpin commented 1 month ago

I haven't been able to reproduce this :-(

getappmap / navie-benchmark