Closed abrichr closed 4 months ago
We can also use Gemini.
Edit: We have not attempted this since before large video context were available.
I feel that it’d be good to:
- have one baseline that’s almost all just about prompting multi-modal API to do the task. This might get 0% success rate, but having this script will allow us to evaluate any new model that comes out. (The point is that, if AGI or GPT6 happens, this script should be able to suddenly do the work).
- Then add additional techniques like Set-of-Mark prompting (if vision is the bottleneck), that’s almost like 1, without explicit structure. Maybe this doesn’t work at all either.
- Then we diagnose the main bottleneck (is it vision? or planning? or RAG? or outcome determination?), and then add more structure strategically --@LunjunZhang
Feature request
Motivation