datadreamer-dev / DataDreamer

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models.   🤖💤
https://datadreamer.dev
MIT License
806 stars 40 forks source link

Add support for "guidance" and "outlines" #6

Closed amir-in-a-cynch closed 5 months ago

amir-in-a-cynch commented 7 months ago

I find guidance useful in my own dataset generation, just to add certain constraints to outputs. Can we add support for these in DataDreamer? I'd be happy to contribute code for this and a few examples myself, if there's interest from the DataDreamer owners.

AjayP13 commented 7 months ago

Hey, that's a good idea, I have some interest in adding it to DataDreamer, but I think the problem with these methods is they don't have a very standardized API that would fit in with all the other LLM providers we have in the library. And I don't want to overcomplicate the API interface by making this one special. My gut feeling is it's better left as something a user might need to integrate with as a one-off than as a core part of this library, at least until there is a standardized approach to this that all LLM providers get behind.

For your own work you might find this page helpful to extend DataDreamer in your own project even if it doesn't get officially added: https://datadreamer.dev/docs/latest/pages/advanced_usage/creating_a_new_datadreamer_.../llm.html

I'm happy to look into any PRs / examples you come up though and consider it, if you feel like you've put together something nice that other can use as part of your work!

amir-in-a-cynch commented 7 months ago

Thanks for the feedback Ajay. Your plan makes sense, for a user to come up with their one-off approach for their needs, and then share their work to see if it's more broadly useful. (But no expectations if it's not.)

I basically need this because I use LLMs for filling fake forms, so it's very constrained by form structure / grammars as to valid synthetic data.

My gut feeling is it's better left as something a user might need to integrate with as a one-off than as a core part of this library, at least until there is a standardized approach to this that all LLM providers get behind.

I can make a pretty standardized template for most open source LLM providers. Probably something like an added "optional" argument like "grammar_constraints: dict=None`. Where the user can pretend it doesn't exist unless they really want it.

One can't do it for OpenAI/Anthropic type API models, and I don't expect they'll ever support that. Because most API models don't expose the raw logits / allow for constraining the generation. Does your standardized approach mean ideally users should be able to constrain generation for all external API and local LLMs alike? Or would local LLMs provide enough value?

AjayP13 commented 7 months ago

Thanks! If you can do it like that with a single parameter grammar_constraints on the .run() / ._run_batch() methods of the LLM class, that would be great, and something I would be very much open to merging, even if it only worked for local models. That seems like a clean enough API implementation.

OpenAI doesn't support general grammars, but they do actually support JSON this via response_format={ "type": "json_object" } which we do support. Maybe we can roll it into that format with something like response_format={ "type": "grammar", "grammar_constraints": {...} }.