`special_tokens` in the encode method doesn't work for the BPETokenizer

Hk669 / bpetokenizer

(py package) train your own tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)

https://pypi.org/project/bpetokenizer/

2 stars 1 forks source link

Closed Hk669 closed 3 weeks ago

Hk669 commented 3 weeks ago

The argument special_tokens doesnt work on the current tokenizer. which splits the special tokens too and encode them seperately.

should append the ids from the special_tokens present in the vocab. instead it does split the tokens into sub and then perform the normal encoding.