Closed fozziethebeat closed 5 months ago
I second this. We have an OpenAI dataset in JSONL of the form { "messages": [{ "role": "user", "content": "..."}, { "role": "assistant", "content": "..."}]}. We are migrating to training open models, and I had some problems using this dataset with axolotl (might be related to #1649 as well). It would probably be beneficial if such a migration path is as simple as possible.
Glad this has support! The OpenAI format seems like a standard format that's worth supporting natively.
+1 This will be quite useful.
+1 urgent problem
Feel free to double check the PR and make sure I covered the core use case properly
⚠️ Please check that this feature request hasn't been suggested before.
🔖 Feature description
Right now the
chat_template
strategy requires data to be in theconversations
field using what I assume is thesharegpt
data format withfrom
andvalue
keys and a specific set of roles (human
,user
,assistant
, andgpt
) as made clear from the unittest.It would be nice if we can specify
chat_template
and then configure all these fields such as something like:data_field
(defaults toconversations
role_map
(defaults to above)role_field
(defaults tofrom
)value_field
(defaults tovalue
)Ideally a user could configure a dataset with a format having a
conversation
field where the entries look like:And then be able to dynamically map it appropriate so that it works with
tokenizer.apply_chat_template
✔️ Solution
I think it involves copying some of the configs specific to the
sharegpt
prompt strategy and forwarding them into theget_conversation_thread
converter function.With some suggestions on desired configuration options and fields, I'm happy to implement and test this.
❓ Alternatives
Alternatives of course are to require users to reformat their data to
sharegpt
and then they can roughly do the equivalent configuration.📝 Additional Context
No response
Acknowledgements