Planning for smart utilization of 3.5

GPT 3.5 is so much cheaper (.002 vs .06 / k tokens), not to mention it it usually returns faster and is less throttled. Given that, it makes sense to always at least attempt to use GPT 3.5 first.

Given we are gonna try GPT 3.5 first, how do we determine when to fallback to GPT 4? 1) When compiling the prompt for our completion, if it leaves less than n tokens remaining for completion, where n is the smallest number we expect to possibly hold an expected completion. IMO 500 tokens is a reasonable amount to reserve. But that's a variable that could use empirical measurement. 2) When receiving the completion, we should prompt for the answer to be wrapped in some delimiters to detect if there were not enough tokens for GPT 3.5 to complete it's attempted answer. 3) When detecting if the completion resulted in a fix for the current error. It may however be worth retrying here while slowly ratcheting up temperature. Or feeding new error back in. Need a way to check if GPT is just introducing new errors that happen before the original error could happen.

Additionally, code should include future proofing for fallback to the 32K model using rules 1 & 2 (since it's not smarter, just bigger). Obviously disabled by flag. Similarly allow disabling of 4-8k using the same system.

biobootloader / wolverine

Planning for smart utilization of 3.5 #6