axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.87k stars 866 forks source link

Generalize the `chat_template` prompt strategy with more configuration options #1654

Closed fozziethebeat closed 5 months ago

fozziethebeat commented 5 months ago

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

Right now the chat_template strategy requires data to be in the conversations field using what I assume is the sharegpt data format with from and value keys and a specific set of roles (human, user, assistant, and gpt) as made clear from the unittest.

It would be nice if we can specify chat_template and then configure all these fields such as something like:

Ideally a user could configure a dataset with a format having a conversation field where the entries look like:

[
  { 'role': 'My user', 'content': 'something' },
  { 'role': 'My assistant', 'content': "response' },
  ....
]

And then be able to dynamically map it appropriate so that it works with tokenizer.apply_chat_template

✔️ Solution

I think it involves copying some of the configs specific to the sharegpt prompt strategy and forwarding them into the get_conversation_thread converter function.

With some suggestions on desired configuration options and fields, I'm happy to implement and test this.

❓ Alternatives

Alternatives of course are to require users to reformat their data to sharegpt and then they can roughly do the equivalent configuration.

📝 Additional Context

No response

Acknowledgements

magbyr commented 5 months ago

I second this. We have an OpenAI dataset in JSONL of the form { "messages": [{ "role": "user", "content": "..."}, { "role": "assistant", "content": "..."}]}. We are migrating to training open models, and I had some problems using this dataset with axolotl (might be related to #1649 as well). It would probably be beneficial if such a migration path is as simple as possible.

fozziethebeat commented 5 months ago

Glad this has support! The OpenAI format seems like a standard format that's worth supporting natively.

mlmonk commented 5 months ago

+1 This will be quite useful.

wayne-wang-1119 commented 5 months ago

+1 urgent problem

fozziethebeat commented 5 months ago

Feel free to double check the PR and make sure I covered the core use case properly