PR #1756 breaks last turn tokenization for phi-3.5

fozziethebeat commented 1 month ago

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

When tokenizing a simple dataset (fozziethebeat/alpaca_messages_2k_test) using microsoft/Phi-3.5-mini-instruct we should expect the last assistant turn and the end of turn tokens to all be included in the labels.

We'd expect something like

[2024-09-13 11:25:50,070] [INFO] [axolotl.check_example_labels:45] [PID:276898] [RANK:0] ..... Education(13151, 13151) is(338, 338) a(263, 263) powerful(13988, 13988) tool(5780, 5780) that(393, 393) has(756, 756) the(278, 278) ability(11509, 11509) to(304, 304) provide(3867, 3867) soci(5374, 5374) eties(20850, 20850) with(411, 411) long(1472, 1472) -(29899, 29899) term(8489, 8489) solutions(6851, 6851) to(304, 304) many(1784, 1784) of(310, 310) the(278, 278) world(3186, 3186) '(29915, 29915) s(29879, 29879) problems(4828, 4828) .(29889, 29889) Many(9267, 9267) of(310, 310) these(1438, 1438) problems(4828, 4828) can(508, 508) be(367, 367) trac(16703, 16703) ed(287, 287) back(1250, 1250) to(304, 304) a(263, 263) lack(10225, 10225) of(310, 310) education(9793, 9793) ,(29892, 29892) which(607, 607) highlight(12141, 12141) s(29879, 29879) the(278, 278) importance(13500, 13500) of(310, 310) providing(13138, 13138) people(2305, 2305) with(411, 411) a(263, 263) good(1781, 1781) foundation(22778, 22778) in(297, 297) education(9793, 9793) .(29889, 29889) By(2648, 2648) invest(13258, 13258) ing(292, 292) in(297, 297) education(9793, 9793) ,(29892, 29892) we(591, 591) can(508, 508) emp(3710, 3710) ower(1680, 1680) individuals(15724, 15724) and(322, 322) communities(23507, 23507) to(304, 304) tack(22002, 22002) le(280, 280) challeng(18066, 18066) es(267, 267) and(322, 322) create(1653, 1653) positive(6374, 6374) change(1735, 1735) .(29889, 29889) <|end|>(32007, 32007)

Note <|end|>(32007, 32007)

Current behaviour

Due to the advanced functionality of #1756 the last end of turn token is masked out. I think this is due to a new set of defaults for these advanced per-turn masking features conflicting with how phi-3.5 configures its end of turn and end of sentence tokens (they're different).

Currently we get

[2024-09-13 11:26:47,956] [INFO] [axolotl.check_example_labels:45] [PID:277347] [RANK:0] <s>(-100, 1) <|user|>(-100, 32010) F(-100, 383) ill(-100, 453) in(-100, 297) the(-100, 278) bl(-100, 1999) anks(-100, 1331) to(-100, 304) complete(-100, 4866) the(-100, 278) sentence(-100, 10541) .(-100, 29889) Global(-100, 12002) war(-100, 1370) ming(-100, 4056) can(-100, 508) be(-100, 367) revers(-100, 18764) ed(-100, 287) by(-100, 491) reducing(-100, 27668) _(-100, 903) ____(-100, 7652) ___(-100, 22359) and(-100, 322) _(-100, 903) ________(-100, 14365) _.(-100, 5396) <|end|>(-100, 32007) <|assistant|>(-100, 32001) Global(12002, 12002) war(1370, 1370) ming(4056, 4056) can(508, 508) be(367, 367) revers(18764, 18764) ed(287, 287) by(491, 491) reducing(27668, 27668) green(7933, 7933) house(8697, 8697) gas(10489, 10489) em(953, 953) issions(6847, 6847) and(322, 322) def(822, 822) or(272, 272) est(342, 342) ation(362, 362) .(29889, 29889) <|end|>(-100, 32007)

Note the <|end|>(-100, 32007)

Steps to reproduce

Use the provided config yaml (in examples/phi/lora-3.5.yaml)
Run python -m axolotl.cli.preprocess examples/phi/lora-3.5.yaml --debug
Inspect output and see incorrect labels for the <|end|> token

Config yaml

base_model: microsoft/Phi-3.5-mini-instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: true
load_in_4bit: false
strict: false

chat_template: phi_3
datasets:
  - path: fozziethebeat/alpaca_messages_2k_test
    type: chat_template
    chat_template: phi_3
    field_messages: messages
    message_field_role: role
    message_field_content: content
    roles:
      user:
        - user
      assistant:
        - assistant

dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/lora-out

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 4
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bfloat16: true
bf16: true
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
s2_attention:

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 4
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:

Possible solution

this pr

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main/3853ab7ae9220dfbd78cd628e54fde75fb89df97

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

hammoudhasan commented 1 month ago

I'm having a similar issue for mistralai/Mistral-7B-Instruct-v0.3 tokenization goes wrong. Most samples are dropped in Drop Samples with Zero Trainable Tokens step (even after pulling the latest repo after the PR being merged). No clue why particularly this mistral.

fozziethebeat commented 1 month ago

I added a fix to this problem in this pr. Can you try adding another unittest that's similar to the phi-3.5 test I added and see if the same behavior is happening?

ehartford commented 1 month ago

Thank you for this

NanoCode012 commented 3 weeks ago

Hey @fozziethebeat , since the PR has been merged, should this Issue be closed?

fozziethebeat commented 3 weeks ago

Yes~

axolotl-ai-cloud / axolotl