Generalize the `chat_template` prompt strategy with more configuration options

fozziethebeat commented 5 months ago

⚠️ Please check that this feature request hasn't been suggested before.

[X] I searched previous Ideas in Discussions didn't find any similar feature requests.
[X] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Right now the chat_template strategy requires data to be in the conversations field using what I assume is the sharegpt data format with from and value keys and a specific set of roles (human, user, assistant, and gpt) as made clear from the unittest.

It would be nice if we can specify chat_template and then configure all these fields such as something like:

data_field (defaults to conversations
role_map (defaults to above)
role_field (defaults to from)
value_field (defaults to value)

Ideally a user could configure a dataset with a format having a conversation field where the entries look like:

[
  { 'role': 'My user', 'content': 'something' },
  { 'role': 'My assistant', 'content': "response' },
  ....
]

And then be able to dynamically map it appropriate so that it works with tokenizer.apply_chat_template

✔️ Solution

I think it involves copying some of the configs specific to the sharegpt prompt strategy and forwarding them into the get_conversation_thread converter function.

With some suggestions on desired configuration options and fields, I'm happy to implement and test this.

❓ Alternatives

Alternatives of course are to require users to reformat their data to sharegpt and then they can roughly do the equivalent configuration.

📝 Additional Context

No response

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this feature has not been requested yet.
[X] I have provided enough information for the maintainers to understand and evaluate this request.

magbyr commented 5 months ago

I second this. We have an OpenAI dataset in JSONL of the form { "messages": [{ "role": "user", "content": "..."}, { "role": "assistant", "content": "..."}]}. We are migrating to training open models, and I had some problems using this dataset with axolotl (might be related to #1649 as well). It would probably be beneficial if such a migration path is as simple as possible.

fozziethebeat commented 5 months ago

Glad this has support! The OpenAI format seems like a standard format that's worth supporting natively.

mlmonk commented 5 months ago

+1 This will be quite useful.

wayne-wang-1119 commented 5 months ago

+1 urgent problem

fozziethebeat commented 5 months ago

Feel free to double check the PR and make sure I covered the core use case properly

axolotl-ai-cloud / axolotl