nomerge: LLM compare - Githubissues

getappmap / navie-benchmark

Navie benchmarks

MIT License

0 stars 0 forks source link

nomerge: LLM compare #52

Open kgilpin opened 2 months ago

kgilpin commented 2 months ago

This branch is used for official LLM comparison data.

No changes should be applied to this branch, other than fixes to get LLMs working.

Workflow updates and improvements should not be applied to this branch, because that will invalidate the comparisons.

Results spreadsheet: https://docs.google.com/spreadsheets/d/1GjOKDzVyrFN6rh_xIP96JaaJxXDOU2ok6a6Y1I3osbI/edit?gid=1776346381#gid=1776346381

Results

Limits: 3 test files, 3 test status retries, 3 code files, 3 code status retries
Context tokens: 16,000
Instances: 167
Characters per token: 4.2

Metric	sonnet-20240620	gpt-4o-2024-08-06
Date	2024-09-19	2024-09-19
Resolved %	26.3%	32.3%
Code file match %	53%	61%
Test file match %	23%	23%
Average cost	$1.35	$0.94
Avg elapsed time (min)	8.7	5.5
Resolved (=2)	42%	38%
Resolved (=3)	77%	67%
Input cost per 1MM	$3	$2.50
Output cost per 1MM	$15	$10.00
Sent chars	254,092,577.00	234,585,594.00
Received chars	12,274,574.00	7,067,731.00
Total cost	$225.33	$156.46
Stddev elapsed time	5.35	4.20
Lint repair average	4.08	2.41
Test gen average	5.06	5.11
Test gen success average	3.92	3.36
Code gen average	4.85	4.61
Edit test file %	54%	58%
Test patch gen %	54%	58%
Inverted patch gen %	46%	52%
Pass to Pass %	81%	92%
Pass to Fail %	28%	32%
Fail to Pass %	15%	20%
Average score	1.27	1.39
Resolved count	44	54