axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.48k stars 808 forks source link

Allow using tokenizer's default chat template or pass custom jinja chat template #1732

Open chiragjn opened 1 month ago

chiragjn commented 1 month ago

Closes #1689

Summary of changes:

  1. Adds tokenizer_default as option for chat_template in chat_template prompt strategy that allows using the chat template from tokenizer's config.json
  2. Allows falling back to chat templates available in axolotl if tokenizer does not have a chat template
  3. Allows passing in a custom chat template using a new option chat_template_jinja which is only considered when chat_template option is set to jinja

Why?

Many popular models are not trained with chatml format. As a result for the model to correctly learn chatml we have to turn on train_on_inputs which requires more compute and time. If we can use the model's already learned chat template we can just learn the output tokens


chiragjn commented 1 month ago

On a separate note, I would also like to explore how can we support passing a jinja template entirely via the config itself which might me more flexible

winglian commented 1 month ago

Thanks! Would you have some time to put together some tests for this?

chiragjn commented 1 month ago

Added a few tests

I wonder if it is a good idea to use the chat_template available from the trained model's tokenizer_config.json as the default template?

Yes, that is what this PR aims to do, if we all agree, we can change the default in config to tokenizer_default_fallback_chatml :)

ganler commented 1 month ago

Yes. I am sorry I just read the PR in more detail and found that's exactly what I was hoping for (what a coincidence!) so I removed my previous dumb question. Thanks for the contribution!

ganler commented 1 month ago

I have one question. Can this option be used together with the sharegpt conversation format?

winglian commented 1 month ago

I have one question. Can this option be used together with the sharegpt conversation format?

Unfortunately no, they are different templating libraries.

Tostino commented 1 month ago

Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for https://github.com/axolotl-ai-cloud/axolotl/pull/1756.

chiragjn commented 1 month ago

Thank you for working on this! I was needing just this...and was going to either integrate my custom chat template into axolotl (limits me on experimenting) or work on getting this feature going. Glad you beat me to it. Doesn't look like it'll conflict much with the changes I made for https://github.com/axolotl-ai-cloud/axolotl/pull/1756.

Amazing, would love to see both getting merged

chiragjn commented 1 month ago

Added chat_template_jinja option to provide a Jinja template as per @fozziethebeat's suggestion

chiragjn commented 1 month ago

Bumping for a review here @winglian

nyxkrage commented 2 weeks ago

Regarding point 4, it seems that the jinja template from that repo is very different when compared to how fastchat/axolotl currently formats the prompt, and mistrals templates

FastChat/Axolotl currently places the system message with a single newline inside the first user turn Mistral 0.1 places the system message with 2 newlines inside the first user turn, and with different spacing between [INST] tokens the Mistral v3 tokenizer places the system message with 2 newlines inside the last user turn

The one in the linked repo puts the system prompt before the first [INST], I'm unsure if this is an oversight by the creator of that template/repo or deliberately chosen to be different from the official mistral template

chiragjn commented 2 weeks ago

I can get rid of the mistral template now, anyway this PR now allows passing custom jinja template, I had only added it for convenience

deliberately chosen to be different from the official mistral template

Most likely deliberately as older mistral models did not support system message - so there was no official way for older models

chiragjn commented 2 weeks ago

Bumping this again @winglian

NanoCode012 commented 1 week ago

Please let me know if there's any part you need help with. I think as a summary, the only parts left to be addressed are:

chiragjn commented 1 week ago

Have addressed comments and updated docs to the best of my knowledge. Will test DPO/ORPO.

One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.

NanoCode012 commented 1 week ago

One thing I am not clear on - what should be the behavior is some one provides different chat templates across different datasets? It causes ambiguity in selecting which template to finally save in the tokenizer. Ideally, no one should do it this way but there is no validation to protect against this.

That's a good point. I missed this.

In this case, we may need to adjust the hash to include the chat_template (if exists?). https://github.com/axolotl-ai-cloud/axolotl/blob/17af1d7081414c32614cbabe324e1197ca9f43a7/src/axolotl/utils/data/sft.py#L143

levguy commented 1 week ago

Hi @chiragjn , I'd like to point out a potential issue that might occur when training DPO with prompt strategy chat_template.default and using the base model's (or custom) chat template. In many open source models (e.g. Phi-3-medium-128k-instruct), the chat template validates that the first message has the user role, and raises an exception otherwise. Consider the code here:

        result["chosen"] = tokenizer.apply_chat_template(
            [chosen],
            add_generation_prompt=False,
            chat_template=chat_template_string,
            tokenize=False,
        )

We pass here [chosen], which is a list containing a single message with assistant role. Hence the validation mentioned above will cause raising an exception.

This can be solved by adding a dummy user message, e.g.:

        dummy_user_message = {"role": "user", "content": "dummy"}
        result["chosen"] = tokenizer.apply_chat_template(
            [dummy_user_message, chosen],
            add_generation_prompt=False,
            chat_template=chat_template_str,
            tokenize=False,
        )

For trimming the dummy message, no additional code is needed, as it will be trimmed by the lines that follow:

        chosen_strip_index = result["chosen"].find(chosen["content"])
        result["chosen"] = result["chosen"][chosen_strip_index:].rstrip()

These lines remove what comes before the actual message content (typically an assistant header / special token).

The same workaround is needed in the rejected case as well (here).

NanoCode012 commented 1 day ago

Hello @levguy , thank you for the find. Would it be better to separate that into a different PR to limit the scope of this PR? I think we would want to get this PR in first.

@chiragjn , may I ask if there are any parts blocking to this current PR now? I believe you have addressed the points I lifted. Only the hash of the tokenized dataset remains.

NanoCode012 commented 1 day ago

I notice also that the loss masking may have some issues for chat template (likely not due to this PR, but an old issue).

For ex, it sometimes does not detect the assistant portion correctly.

# user masked correctly
(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) user(-100, 1645) 
(-100, 108) Each(-100, 7639)  problem(-100, 3210)  consists(-100, 13569)  of(-100, 576)  three(-100, 2149)
...
# assistant also masked
 <(-100, 235322) |(-100, 235371) im(-100, 571) _(-100, 235298) start(-100, 2997) |>(-100, 73786) assistant(-100, 105776) 
(-100, 108) If(-100, 2495)  the(-100, 573)  third(-100, 4906)  statement(-100, 6218)  is(-100, 603)  true(-100, 1382)  ((-100, 591) Bananas(-100, 217061)  cost(-100, 2458) 

Config to recreate:

base_model: mlx-community/gemma-2-2b

datasets:
  - path: LDJnr/Puffin
    type: chat_template
    chat_template: chatml
    shards: 10
chat_template: chatml
chiragjn commented 1 day ago

Sorry, I did not mean to block this PR It is just that I was running busy at my main job, I wanted to address the DPO issues before merging, I'll make some time for this