[R-302] Support function-calling / json mode / structured generation for testset generation

ahgraber commented 1 month ago

Describe the Feature Most service APIs now support enforcing schema outputs through function calling, json mode, or structured generation. It would be really useful to have an option that would use the service API to enforce schema constraints rather than hoping chat prompts follow the expected format.

Why is the feature important for you? With OpenAI, synthetic generation works flawlessly 99% of the time. With Anthropic or Llama models, I get frequent parse errors, which end up retrying and ultimately failing. This uses a lot of tokens (and therefore $). Concretely, generating a testset of 100 questions, gpt-4o-mini uses ~660k input and produces ~13k output tokens. When I attempt to generate a testset from the same knowledge graph with Anthropic Claude 3.5 sonnet, the generation fails from parse errors but I still end up using ~850k input and ~22.5k output tokens due to the retries!

Additional context Given most of the responses are being parsed with Pydantic, it should be fairly trivial to turn the desired Pydantic object into a jsonschema (hint: openai provides openai.pydantic_function_tool() to convert Pydantic models to openai-compatible subset jsonschema)

_R-302

jjmachan commented 1 month ago

@ahgraber thanks for the suggestion - we should definitely do that as the default for the services that do support it

ref: https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/ something on top of this should work

also would love to chat sometime with you too Alex and get more feedback. I've send you an email to connect. Are you on discord btw

cheers ❤️ Jithin

ahgraber commented 1 month ago

For evals, many of the prompts seem to request a numeric answer (Context Recall -> "Attributed=0/1", Context Precision -> "Verdict=0/1" While structured generation would work here, perhaps an even better option would be constraining the outputs to just tokens '0' and '1'? OpenAI supports this with the logit_bias parameter (see openai docs and AAAzzam's twitter thread); I'm not sure how it's integrated into LangChain/Llamaindex and whether it is supported for all/most models.

explodinggradients / ragas

[R-302] Support function-calling / json mode / structured generation for testset generation #1532