Hk669 / bpetokenizer

(py package) train your own tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)
https://pypi.org/project/bpetokenizer/
2 stars 1 forks source link

`special_tokens` in the encode method doesn't work for the BPETokenizer #12

Closed Hk669 closed 3 weeks ago

Hk669 commented 3 weeks ago

Describe the bug

The argument special_tokens doesnt work on the current tokenizer. which splits the special tokens too and encode them seperately.

Expected behavior

should append the ids from the special_tokens present in the vocab. instead it does split the tokens into sub and then perform the normal encoding.

Screenshots

image