axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.85k stars 864 forks source link

Warn if we override the chat template in the tokenizer config #1822

Closed chiwanpark closed 2 months ago

chiwanpark commented 2 months ago

Description

This PR adds a warning message if we override the chat template in the tokenizer config with the one in the training config.

Motivation and Context

This PR resolves #1120.

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

NanoCode012 commented 2 months ago

Hey, thanks for PR. Can you also add a check if the new chat template string is different? We don't want to warn if it's the same.

Could you also show some test logs to show when it properly appears/not?

chiwanpark commented 2 months ago

@NanoCode012 Sorry for the late response. I've updated the patch to warn only if the overriding template is different from the one from tokenizer. Here is an example log when I apply chatml template to gemma-2-2b-it model:

                                 dP            dP   dP 
                                 88            88   88 
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88 
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88 
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88 
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP 

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.32.0         
        peft: 0.12.0         
transformers: 4.43.4         
         trl: 0.9.6          
       torch: 2.4.0+cu124    
bitsandbytes: 0.43.3         
****************************************
[2024-08-24 15:28:54,225] [INFO] [axolotl.utils.config.models.input.check_eval_packing:964] [PID:21491] [RANK:0] explicitly setting `eval_sample_packing` to match `sample_packing`
[2024-08-24 15:28:54,227] [DEBUG] [axolotl.normalize_config:83] [PID:21491] [RANK:0] bf16 support not detected, disabling for this configuration.
[2024-08-24 15:28:54,236] [INFO] [axolotl.normalize_config:183] [PID:21491] [RANK:0] GPU memory usage baseline: 0.000GB ()
[2024-08-24 15:28:56,078] [DEBUG] [axolotl.load_tokenizer:282] [PID:21491] [RANK:0] EOS: 107 / <end_of_turn>
[2024-08-24 15:28:56,078] [DEBUG] [axolotl.load_tokenizer:283] [PID:21491] [RANK:0] BOS: 2 / <bos>
[2024-08-24 15:28:56,078] [DEBUG] [axolotl.load_tokenizer:284] [PID:21491] [RANK:0] PAD: 0 / <pad>
[2024-08-24 15:28:56,078] [DEBUG] [axolotl.load_tokenizer:285] [PID:21491] [RANK:0] UNK: 3 / <unk>
[2024-08-24 15:28:56,079] [WARNING] [axolotl.load_tokenizer:295] [PID:21491] [RANK:0] The chat template is being overridden by the one in the config. This may cause a mismatch with the existing chat template in the base model. Please verify that this override is intended and compatible.
NanoCode012 commented 2 months ago

Hello, we're currently trying to get a PR #1732 in which should make this more explicit. For ex, setting chat_template: tokenizer_default would use the tokenizer's setting. This way, if a user explicitly sets a different template like chatml, we would believe that they understand what they're doing.

Would this be reasonable? If so, we can close this PR.

chiwanpark commented 2 months ago

@NanoCode012 Yep, the approach with chat_template: tokenizer_default would be better than this warning. I close this PR. :)