getappmap / navie-benchmark

Navie benchmarks
MIT License
0 stars 0 forks source link

nomerge: LLM compare #52

Open kgilpin opened 2 months ago

kgilpin commented 2 months ago

This branch is used for official LLM comparison data.

No changes should be applied to this branch, other than fixes to get LLMs working.

Workflow updates and improvements should not be applied to this branch, because that will invalidate the comparisons.

Results spreadsheet: https://docs.google.com/spreadsheets/d/1GjOKDzVyrFN6rh_xIP96JaaJxXDOU2ok6a6Y1I3osbI/edit?gid=1776346381#gid=1776346381

Results

Metric sonnet-20240620 gpt-4o-2024-08-06
Date 2024-09-19 2024-09-19
Resolved % 26.3% 32.3%
Code file match % 53% 61%
Test file match % 23% 23%
Average cost $1.35 $0.94
Avg elapsed time (min) 8.7 5.5
Resolved (=2) 42% 38%
Resolved (=3) 77% 67%
Input cost per 1MM $3 $2.50
Output cost per 1MM $15 $10.00
Sent chars 254,092,577.00 234,585,594.00
Received chars 12,274,574.00 7,067,731.00
Total cost $225.33 $156.46
Stddev elapsed time 5.35 4.20
Lint repair average 4.08 2.41
Test gen average 5.06 5.11
Test gen success average 3.92 3.36
Code gen average 4.85 4.61
Edit test file % 54% 58%
Test patch gen % 54% 58%
Inverted patch gen % 46% 52%
Pass to Pass % 81% 92%
Pass to Fail % 28% 32%
Fail to Pass % 15% 20%
Average score 1.27 1.39
Resolved count 44 54