axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.48k stars 808 forks source link

Data Gets Tokenized Before Special Tokens Are Added #1770

Closed hammoudhasan closed 1 month ago

hammoudhasan commented 1 month ago

Please check that this issue hasn't been reported before.

Expected Behavior

When one defines special tokens or added tokens they should be added to the tokenizer configuration figure before running the preprocessing tokenization step.

Current behaviour

Currently data is being tokenized without the specified new special tokens (i.e replaced by spaces where as in the defined chat template should appear).

Steps to reproduce

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

winglian commented 1 month ago

@hammoudhasan did you try adding?

tokens:
  - "<|im_start|>"
  - "<|im_end|>"
hammoudhasan commented 1 month ago

I don't think I had those in my config. Let me test and get back to you on that (: Thank you @winglian

hammoudhasan commented 1 month ago

As you mentioned those tokens were missing from my end ! Adding those worked.