Possible issue with tokenizer_config.json in HF model upload of CodeQwen 1.5 7B

artemdinaburg commented 6 months ago

Hi!

I tried to reach someone via the Discord but haven't been able to, so I figured I'd post here.

The CodeQwen 1.5 7B model on HF defines the model as Qwen2ForCausalLM (https://huggingface.co/Qwen/CodeQwen1.5-7B/blob/main/config.json) but the tokenizer as PreTrainedTokenizerFast (https://huggingface.co/Qwen/CodeQwen1.5-7B/blob/main/tokenizer_config.json#L14).

This is probably wrong, since PretrainedTokenizerFast is the tokenizer base class, and more importantly it defines token_type_ids (https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L1562) but the model Qwen2ForCausalLM doesn't handle that return vector from tokenization.

This causes both tokenization and finetuning failures when using Axolotl to finetune CodeQwen 1.5 7B in completion mode.

The correct choice is almost certainly to set the tokenizer to Qwen2Tokenizer, like in Qwen1.5 regular, non-code models (see: https://huggingface.co/Qwen/Qwen1.5-4B/blob/main/tokenizer_config.json#L38)

I tried manually editing tokenizer_config.json on a local QwenCode 1.5 download and confirm it fixes my finetuning issues.

artemdinaburg commented 6 months ago

As another reference, this problem looks to be an exact replica of https://github.com/UKPLab/sentence-transformers/issues/2588 which may explain things better.

cyente commented 6 months ago

CodeQwen now use the different tokenizer compared with Qwen1.5. Using Qwen2Tokenizer may not be correct.

Could you provide more reproducible descriptions, including environment versions, demo scripts, etc?

(FYI, we update the config information 17 days ago, to fix the updating of PreTrainedTokenizerFast of transformers=4.40.0)

artemdinaburg commented 6 months ago

Hi! Thank you so much for getting back to me! I can replicate the issue in Axolotl, and to sanity check, I wanted to see if I could also replicate it in LLamaFactory. I could not, but interestingly saw this:

[INFO|trainer.py:804] 2024-05-30 20:39:25,035 >> The following columns in the training set don't have a corresponding argument in `PeftModelForCausalLM.forward` and have been ignored: token_type_ids. If token_type_ids are not expected by `PeftModelForCausalLM.forward`, you can safely ignore this message.

It looks like the base transformers.Trainer implementation will automatically remove unused columns from the Dataset in a call to get_train_dataloader, and avoids this error when run via LLamaFactory. Axolotl defines a custom get_train_dataloader function that does not ignore unused columns and so the following failure happens:

python3 -m axolotl.cli.train codeqwen.yml
[2024-05-30 22:40:32,983] [INFO] [datasets.<module>:58] [PID:41130] PyTorch version 2.3.0 available.
[2024-05-30 22:40:34,099] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-30 22:40:34,171] [INFO] [root.spawn:38] [PID:41130] gcc -pthread -B /opt/conda/envs/axo/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/axo/include -fPIC -O2 -isystem /opt/conda/envs/axo/include -fPIC -c /var/tmp/tmpmhbgpa43/test.c -o /var/tmp/tmpmhbgpa43/test.o
[2024-05-30 22:40:34,192] [INFO] [root.spawn:38] [PID:41130] gcc -pthread -B /opt/conda/envs/axo/compiler_compat /var/tmp/tmpmhbgpa43/test.o -laio -o /var/tmp/tmpmhbgpa43/a.out
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.0), only 1.0.0 is known to be compatible
[2024-05-30 22:40:35,431] [WARNING] [axolotl.utils.config.models.input.hint_sample_packing_padding:747] [PID:41130] [RANK:0] `pad_to_sequence_len: true` is recommended when using sample_packing
[2024-05-30 22:40:35,431] [WARNING] [axolotl.utils.config.models.input.check_sample_packing_wo_flash:730] [PID:41130] [RANK:0] sample_packing without flash_attention or sdp_attention does not handle cross-attention.
[2024-05-30 22:40:35,431] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:308] [PID:41130] [RANK:0] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2024-05-30 22:40:35,432] [DEBUG] [axolotl.normalize_config:79] [PID:41130] [RANK:0] bf16 support detected, enabling for this configuration.
/opt/conda/envs/axo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2024-05-30 22:40:35,562] [INFO] [axolotl.normalize_config:182] [PID:41130] [RANK:0] GPU memory usage baseline: 0.000GB (+0.537GB misc)
                                 dP            dP   dP
                                 88            88   88
      .d8888b. dP.  .dP .d8888b. 88 .d8888b. d8888P 88
      88'  `88  `8bd8'  88'  `88 88 88'  `88   88   88
      88.  .88  .d88b.  88.  .88 88 88.  .88   88   88
      `88888P8 dP'  `dP `88888P' dP `88888P'   dP   dP

****************************************
**** Axolotl Dependency Versions *****
  accelerate: 0.30.1
        peft: 0.11.1
transformers: 4.41.1
         trl: 0.8.6
       torch: 2.3.0
bitsandbytes: 0.43.1
****************************************
/opt/conda/envs/axo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2024-05-30 22:40:36,893] [DEBUG] [axolotl.load_tokenizer:280] [PID:41130] [RANK:0] EOS: 2 / <|endoftext|>
[2024-05-30 22:40:36,893] [DEBUG] [axolotl.load_tokenizer:281] [PID:41130] [RANK:0] BOS: 2 / <|endoftext|>
[2024-05-30 22:40:36,893] [DEBUG] [axolotl.load_tokenizer:282] [PID:41130] [RANK:0] PAD: 92298 / <fim_pad>
[2024-05-30 22:40:36,893] [DEBUG] [axolotl.load_tokenizer:283] [PID:41130] [RANK:0] UNK: 0 / <unk>
[2024-05-30 22:40:36,893] [INFO] [axolotl.load_tokenizer:294] [PID:41130] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-30 22:40:36,893] [INFO] [axolotl.load_tokenized_prepared_datasets:183] [PID:41130] [RANK:0] Unable to find prepared dataset in cq/last/fbee903f846296c4151e7746385e182d
[2024-05-30 22:40:36,893] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:41130] [RANK:0] Loading raw datasets...
[2024-05-30 22:40:36,894] [WARNING] [axolotl.load_tokenized_prepared_datasets:186] [PID:41130] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2024-05-30 22:40:36,894] [INFO] [axolotl.load_tokenized_prepared_datasets:193] [PID:41130] [RANK:0] No seed provided, using default seed of 42
Generating train split: 3 examples [00:00, 659.79 examples/s]
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
[2024-05-30 22:40:37,140] [WARNING] [datasets.arrow_dataset.map:3087] [PID:41130] num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Tokenizing Prompts (num_proc=3): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  6.21 examples/s]
[2024-05-30 22:40:37,739] [INFO] [axolotl.load_tokenized_prepared_datasets:414] [PID:41130] [RANK:0] merging datasets
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
[2024-05-30 22:40:37,743] [WARNING] [datasets.arrow_dataset.map:3087] [PID:41130] num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Dropping Long Sequences (num_proc=3): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 25.83 examples/s]
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
[2024-05-30 22:40:37,909] [WARNING] [datasets.arrow_dataset.map:3087] [PID:41130] num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Add position_id column (Sample Packing) (num_proc=3): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 25.03 examples/s]
[2024-05-30 22:40:38,080] [INFO] [axolotl.load_tokenized_prepared_datasets:427] [PID:41130] [RANK:0] Saving merged prepared dataset to disk... cq/last/fbee903f846296c4151e7746385e182d
Saving the dataset (1/1 shards): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 515.38 examples/s]
[2024-05-30 22:40:38,095] [DEBUG] [axolotl.calculate_total_num_steps:299] [PID:41130] [RANK:0] total_num_tokens: 394
[2024-05-30 22:40:38,096] [DEBUG] [axolotl.calculate_total_num_steps:312] [PID:41130] [RANK:0] `total_supervised_tokens: 394`
[2024-05-30 22:40:43,669] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:41130] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 394
[2024-05-30 22:40:43,669] [DEBUG] [axolotl.calculate_total_num_steps:364] [PID:41130] [RANK:0] data_loader_len: 0
[2024-05-30 22:40:43,669] [INFO] [axolotl.calc_sample_packing_eff_est:370] [PID:41130] [RANK:0] sample_packing_eff_est across ranks: [0.2565104166666667]
[2024-05-30 22:40:43,669] [DEBUG] [axolotl.calculate_total_num_steps:382] [PID:41130] [RANK:0] sample_packing_eff_est: 0.26
[2024-05-30 22:40:43,669] [DEBUG] [axolotl.calculate_total_num_steps:390] [PID:41130] [RANK:0] total_num_steps: 0
[2024-05-30 22:40:43,684] [DEBUG] [axolotl.train.train:56] [PID:41130] [RANK:0] loading tokenizer... Qwen/CodeQwen1.5-7B
/opt/conda/envs/axo/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[2024-05-30 22:40:44,742] [DEBUG] [axolotl.load_tokenizer:280] [PID:41130] [RANK:0] EOS: 2 / <|endoftext|>
[2024-05-30 22:40:44,742] [DEBUG] [axolotl.load_tokenizer:281] [PID:41130] [RANK:0] BOS: 2 / <|endoftext|>
[2024-05-30 22:40:44,742] [DEBUG] [axolotl.load_tokenizer:282] [PID:41130] [RANK:0] PAD: 92298 / <fim_pad>
[2024-05-30 22:40:44,742] [DEBUG] [axolotl.load_tokenizer:283] [PID:41130] [RANK:0] UNK: 0 / <unk>
[2024-05-30 22:40:44,742] [INFO] [axolotl.load_tokenizer:294] [PID:41130] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-05-30 22:40:44,742] [DEBUG] [axolotl.train.train:85] [PID:41130] [RANK:0] loading model and peft_config...
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00,  1.45s/it]
/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:520: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.95` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
[2024-05-30 22:40:52,523] [INFO] [axolotl.load_model:734] [PID:41130] [RANK:0] GPU memory usage after model load: 3.350GB (+0.066GB cache, +0.751GB misc)
[2024-05-30 22:40:52,546] [INFO] [axolotl.load_model:785] [PID:41130] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2024-05-30 22:40:52,552] [INFO] [axolotl.load_model:794] [PID:41130] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-05-30 22:40:52,556] [INFO] [axolotl.load_lora:951] [PID:41130] [RANK:0] found linear modules: ['q_proj', 'o_proj', 'k_proj', 'down_proj', 'v_proj', 'up_proj', 'gate_proj']
trainable params: 320,339,968 || all params: 7,570,624,512 || trainable%: 4.2314
[2024-05-30 22:40:56,245] [INFO] [axolotl.load_model:843] [PID:41130] [RANK:0] GPU memory usage after adapters: 3.797GB (+1.035GB cache, +0.751GB misc)
/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
[2024-05-30 22:40:56,277] [INFO] [axolotl.train.train:119] [PID:41130] [RANK:0] Pre-saving adapter config to ./cq
[2024-05-30 22:40:56,353] [INFO] [axolotl.train.train:156] [PID:41130] [RANK:0] Starting trainer...
[2024-05-30 22:40:56,600] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:41130] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 394
[2024-05-30 22:40:56,601] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:41130] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 394
  0%|                                                                                                                                                                                                             | 0/1 [00:00<?, ?it/s][2024-05-30 22:40:56,658] [INFO] [axolotl.utils.samplers.multipack._len_est:185] [PID:41130] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 394
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/opt/conda/envs/axo/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/axo/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/artem/axolotl/src/axolotl/cli/train.py", line 70, in <module>
    fire.Fire(do_cli)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/artem/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/home/artem/axolotl/src/axolotl/cli/train.py", line 66, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/home/artem/axolotl/src/axolotl/train.py", line 170, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
    return inner_training_loop(
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/artem/axolotl/src/axolotl/core/trainer_builder.py", line 526, in compute_loss
    return super().compute_loss(model, inputs, return_outputs=return_outputs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/transformers/trainer.py", line 3264, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
    return model_forward(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/peft/peft_model.py", line 1430, in forward
    return self.base_model(
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 179, in forward
    return self.model.forward(*args, **kwargs)
  File "/opt/conda/envs/axo/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
TypeError: Qwen2ForCausalLM.forward() got an unexpected keyword argument 'token_type_ids'
  0%|                                                                                                                                                                                                             | 0/1 [00:00<?, ?it/s]

I'm attaching an Axolotl config and data file which triggers the issue.

Anyway I am not quite sure what should be patched - in theory, the tokenizer should agree with the model for which data columns to expect, but maybe the trainer should also handle the case if its not 🤷.

If there's some way to fix the model so that the data generated by the tokenizer matches whats expected by the model I'd be very thankful. axo_codeqwen.tar.gz

cyente commented 6 months ago

thank you for your notification of this bug.

I found you have already pull request to fix this bug on axolotl.

https://github.com/OpenAccess-AI-Collective/axolotl/pull/1656

QwenLM / Qwen2.5-Coder

Possible issue with tokenizer_config.json in HF model upload of CodeQwen 1.5 7B #77