getappmap / navie-benchmark

Navie benchmarks
MIT License
0 stars 0 forks source link

feat: o1 #56

Closed kgilpin closed 1 month ago

kgilpin commented 2 months ago

Benchmark with o1 and o1-mini.

TODO

Comparison table

Metric GPT-4o o1-mini o1
Resolved % 48.0% 20.0% 32.0%
Code file match % 48% 56% 48%
Test file match % 40% 16% 44%
Average cost $0.32 $0.72 $2.10
Avg elapsed time (min) 3.5 8.6 11.8
Resolved (=2) 75% 0% 33%
Resolved (=3) 100% 33% 57%
Input cost per 1MM $2.50 $3.00 $15.00
Output cost per 1MM $10.00 $12.00 $60.00
Sent chars 11,374,767.00 20,923,708.00 12,392,577.00
Received chars 508,934.00 1,034,709.00 577,764.00
Total cost $7.98 $17.90 $52.51
Instances 25 25 25
Chars per token 4.2 4.2 4.2
Stddev elapsed time 2.95 4.13 5.93
Lint repair average 0.92 5.71 1.17
Test gen average 1.80 2.92 2.29
Test gen success average 1.47 2.44 1.94
Code gen average 2.44 2.17 2.04
Edit test file % 68% 36% 64%
Test patch gen % 68% 36% 64%
Inverted patch gen % 64% 32% 48%
Pass to Pass % 100% 92% 88%
Pass to Fail % 36% 12% 40%
Fail to Pass % 20% 12% 36%
Average score 1.44 1.08 1.63
Resolved count 12 5 8