OpenAutoCoder / Agentless

Agentless🐱: an agentless approach to automatically solve software development problems
MIT License
719 stars 86 forks source link

Reproducing Agentless-1.5 Results on SWE-bech lite #39

Open GCVulnerability opened 2 days ago

GCVulnerability commented 2 days ago

Thanks for improving Agentless. However, I can't reproduce the performance mentioned in the technical report based on the code you provided. When I generate the total files in 'repair_samles_1' - 'repair_samples_4' folders, I cannot generate 'all_preds.jsonl' file using all of 40 samples independently in the 4 folders. So, I merge and renamed the output sample files from ‘output_0_normalized.jsonl' to 'output_39_normalized.jsonl'. After merging, I run 'rerank.py' and generate 'all_preds.jsonl'.

Using gpt-4o-08-06 model and following the instructions in 'readme_swebench.md', I only got 26% pass rate (78/300) on SWE-Bench-lite. Moreover, even if I use all the intermediate results you provided in realese 1.5 and only run 'rerank.py', I can still only achieve a pass rate of 29.67% (89/300).

I was wondering if my use of 40 samples in 4 folders is incorrect? And how can I achieve 32% pass rate which you have submitted to SWE-bench through your intermediate results.

brutalsavage commented 1 day ago

Hi @GCVulnerability

You should not merge the output sample files together instead you should use the rerank script this way:

python agentless/repair/rerank.py --patch_folder results/swe-bench-lite/repair_sample_1/,results/swe-bench-lite/repair_sample_2/,results/swe-bench-lite/repair_sample_3/,results/swe-bench-lite/repair_sample_4 \
                                  --num_samples 40 \
                                  --deduplicate \
                                  --regression \
                                  --reproduction