issues
search
getappmap
/
navie-benchmark
Navie benchmarks
MIT License
0
stars
0
forks
source link
feat: o1
#56
Closed
kgilpin
closed
1 month ago
kgilpin
commented
2 months ago
Benchmark with o1 and o1-mini.
TODO
[x] Need a final branch update of appmap-js for o1 support
Comparison table
5pct of verified set (25 instances)
Limits: 2,2,2
Tokens: 8k
Metric
GPT-4o
o1-mini
o1
Resolved %
48.0%
20.0%
32.0%
Code file match %
48%
56%
48%
Test file match %
40%
16%
44%
Average cost
$0.32
$0.72
$2.10
Avg elapsed time (min)
3.5
8.6
11.8
Resolved (=2)
75%
0%
33%
Resolved (=3)
100%
33%
57%
Input cost per 1MM
$2.50
$3.00
$15.00
Output cost per 1MM
$10.00
$12.00
$60.00
Sent chars
11,374,767.00
20,923,708.00
12,392,577.00
Received chars
508,934.00
1,034,709.00
577,764.00
Total cost
$7.98
$17.90
$52.51
Instances
25
25
25
Chars per token
4.2
4.2
4.2
Stddev elapsed time
2.95
4.13
5.93
Lint repair average
0.92
5.71
1.17
Test gen average
1.80
2.92
2.29
Test gen success average
1.47
2.44
1.94
Code gen average
2.44
2.17
2.04
Edit test file %
68%
36%
64%
Test patch gen %
68%
36%
64%
Inverted patch gen %
64%
32%
48%
Pass to Pass %
100%
92%
88%
Pass to Fail %
36%
12%
40%
Fail to Pass %
20%
12%
36%
Average score
1.44
1.08
1.63
Resolved count
12
5
8
Benchmark with o1 and o1-mini.
TODO
Comparison table