Open CiberNin opened 1 year ago
Good ideas! I'm done limited experimentation with using 3.5-turbo with wolverine (it's now added as a flag). It sometimes works but sometimes fails to return valid json. A quick optimization might be to try a few iterations with 3.5, and if it returns invalid json automatically retry with 4.
GPT 3.5 is so much cheaper (.002 vs .06 / k tokens), not to mention it it usually returns faster and is less throttled. Given that, it makes sense to always at least attempt to use GPT 3.5 first.
Given we are gonna try GPT 3.5 first, how do we determine when to fallback to GPT 4? 1) When compiling the prompt for our completion, if it leaves less than n tokens remaining for completion, where n is the smallest number we expect to possibly hold an expected completion. IMO 500 tokens is a reasonable amount to reserve. But that's a variable that could use empirical measurement. 2) When receiving the completion, we should prompt for the answer to be wrapped in some delimiters to detect if there were not enough tokens for GPT 3.5 to complete it's attempted answer. 3) When detecting if the completion resulted in a fix for the current error. It may however be worth retrying here while slowly ratcheting up temperature. Or feeding new error back in. Need a way to check if GPT is just introducing new errors that happen before the original error could happen.
Additionally, code should include future proofing for fallback to the 32K model using rules 1 & 2 (since it's not smarter, just bigger). Obviously disabled by flag. Similarly allow disabling of 4-8k using the same system.