Open chiragjn opened 1 month ago
On a separate note, I would also like to explore how can we support passing a jinja template entirely via the config itself which might me more flexible
Thanks! Would you have some time to put together some tests for this?
Added a few tests
I wonder if it is a good idea to use the chat_template available from the trained model's tokenizer_config.json as the default template?
Yes, that is what this PR aims to do, if we all agree, we can change the default in config to tokenizer_default_fallback_chatml
:)
Yes. I am sorry I just read the PR in more detail and found that's exactly what I was hoping for (what a coincidence!) so I removed my previous dumb question. Thanks for the contribution!
I have one question. Can this option be used together with the sharegpt conversation format?
I have one question. Can this option be used together with the sharegpt conversation format?
Unfortunately no, they are different templating libraries.
Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for https://github.com/axolotl-ai-cloud/axolotl/pull/1756.
Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for https://github.com/axolotl-ai-cloud/axolotl/pull/1756.
Amazing, would love to see both getting merged
Added chat_template_jinja
option to provide a Jinja template as per @fozziethebeat's suggestion
Bumping for a review here @winglian
Regarding point 4, it seems that the jinja template from that repo is very different when compared to how fastchat/axolotl currently formats the prompt, and mistrals templates
FastChat/Axolotl currently places the system message with a single newline inside the first user turn Mistral 0.1 places the system message with 2 newlines inside the first user turn, and with different spacing between [INST] tokens the Mistral v3 tokenizer places the system message with 2 newlines inside the last user turn
The one in the linked repo puts the system prompt before the first [INST], I'm unsure if this is an oversight by the creator of that template/repo or deliberately chosen to be different from the official mistral template
I can get rid of the mistral template now, anyway this PR now allows passing custom jinja template, I had only added it for convenience
deliberately chosen to be different from the official mistral template
Most likely deliberately as older mistral models did not support system message - so there was no official way for older models
Bumping this again @winglian
Please let me know if there's any part you need help with. I think as a summary, the only parts left to be addressed are:
config.qmd
to list all these new configsconversation.qmd
to show a few examples of this and explain the new optionsHave addressed comments and updated docs to the best of my knowledge. Will test DPO/ORPO.
One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.
One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.
That's a good point. I missed this.
In this case, we may need to adjust the hash to include the chat_template (if exists?). https://github.com/axolotl-ai-cloud/axolotl/blob/17af1d7081414c32614cbabe324e1197ca9f43a7/src/axolotl/utils/data/sft.py#L143
Hi @chiragjn ,
I'd like to point out a potential issue that might occur when training DPO with prompt strategy chat_template.default
and using the base model's (or custom) chat template.
In many open source models (e.g. Phi-3-medium-128k-instruct
), the chat template validates that the first message has the user
role, and raises an exception otherwise.
Consider the code here:
result["chosen"] = tokenizer.apply_chat_template(
[chosen],
add_generation_prompt=False,
chat_template=chat_template_string,
tokenize=False,
)
We pass here [chosen]
, which is a list containing a single message with assistant
role. Hence the validation mentioned above will cause raising an exception.
This can be solved by adding a dummy user message, e.g.:
dummy_user_message = {"role": "user", "content": "dummy"}
result["chosen"] = tokenizer.apply_chat_template(
[dummy_user_message, chosen],
add_generation_prompt=False,
chat_template=chat_template_str,
tokenize=False,
)
For trimming the dummy message, no additional code is needed, as it will be trimmed by the lines that follow:
chosen_strip_index = result["chosen"].find(chosen["content"])
result["chosen"] = result["chosen"][chosen_strip_index:].rstrip()
These lines remove what comes before the actual message content (typically an assistant
header / special token).
The same workaround is needed in the rejected
case as well (here).
Hello @levguy , thank you for the find. Would it be better to separate that into a different PR to limit the scope of this PR? I think we would want to get this PR in first.
@chiragjn , may I ask if there are any parts blocking to this current PR now? I believe you have addressed the points I lifted. Only the hash of the tokenized dataset remains.
I notice also that the loss masking may have some issues for chat template (likely not due to this PR, but an old issue).
For ex, it sometimes does not detect the assistant portion correctly.
# user masked correctly
(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) user(-100, 1645)
(-100, 108) Each(-100, 7639) problem(-100, 3210) consists(-100, 13569) of(-100, 576) three(-100, 2149)
...
# assistant also masked
<(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) assistant(-100, 105776)
(-100, 108) If(-100, 2495) the(-100, 573) third(-100, 4906) statement(-100, 6218) is(-100, 603) true(-100, 1382) ((-100, 591) Bananas(-100, 217061) cost(-100, 2458)
Config to recreate:
base_model: mlx-community/gemma-2-2b
datasets:
- path: LDJnr/Puffin
type: chat_template
chat_template: chatml
shards: 10
chat_template: chatml
Sorry, I did not mean to block this PR It is just that I was running busy at my main job, I wanted to address the DPO issues before merging, I'll make some time for this
Closes #1689
Summary of changes:
tokenizer_default
as option forchat_template
inchat_template
prompt strategy that allows using the chat template from tokenizer's config.jsonchat_template_jinja
which is only considered whenchat_template
option is set tojinja
Why?
Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens