feat: o1 - Githubissues

Benchmark with o1 and o1-mini.

TODO

Metric	GPT-4o	o1-mini	o1
Resolved %	48.0%	20.0%	32.0%
Code file match %	48%	56%	48%
Test file match %	40%	16%	44%
Average cost	$0.32	$0.72	$2.10
Avg elapsed time (min)	3.5	8.6	11.8
Resolved (=2)	75%	0%	33%
Resolved (=3)	100%	33%	57%
Input cost per 1MM	$2.50	$3.00	$15.00
Output cost per 1MM	$10.00	$12.00	$60.00
Sent chars	11,374,767.00	20,923,708.00	12,392,577.00
Received chars	508,934.00	1,034,709.00	577,764.00
Total cost	$7.98	$17.90	$52.51
Instances	25	25	25
Chars per token	4.2	4.2	4.2
Stddev elapsed time	2.95	4.13	5.93
Lint repair average	0.92	5.71	1.17
Test gen average	1.80	2.92	2.29
Test gen success average	1.47	2.44	1.94
Code gen average	2.44	2.17	2.04
Edit test file %	68%	36%	64%
Test patch gen %	68%	36%	64%
Inverted patch gen %	64%	32%	48%
Pass to Pass %	100%	92%	88%
Pass to Fail %	36%	12%	40%
Fail to Pass %	20%	12%	36%
Average score	1.44	1.08	1.63
Resolved count	12	5	8