FormulaMonks / kurt

A wrapper for AI SDKs, for building LLM-agnostic structured AI applications in Typescript
MIT License
2 stars 0 forks source link

Capability eval suite #28

Open jemc opened 4 months ago

jemc commented 4 months ago

In our ad hoc testing with KurtOpenAI and KurtVertexAI, we have seen problems like:

We want to be able to formalize this kind of testing for any LLM provider, so we can share empirically validated findings about the relative capabilities of different LLM providers within the context of the features that are important for Kurt users.

I envision:

jemc commented 4 months ago

I've noticed some changes today in VertexAI behavior - I haven't tested extensively but it seems more reliable than before.

This is the kind of situation where it would be helpful to have the capability eval suite I could run, to comprehensively re-test all the various situations where we've found limitations before.

jemc commented 2 months ago

Another issue I found with VertexAI to add to the eval suite: it seems incapable of generating an apostrophe character inside a structured data string field - likely because they are using single-quoted strings under the hood, and the model hasn't been trained to generate an escaped apostrophe character.

Currently, as soon as it encounters an apostrophe in the text it's trying to generate in such a field, Gemini will end the string instead of continuing to generate the rest of the text.