(py package) train your own tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)
2
stars
1
forks
source link
`special_tokens` in the encode method doesn't work for the BPETokenizer #12
Closed
Hk669 closed 3 weeks ago
Describe the bug
The argument
special_tokens
doesnt work on the current tokenizer. which splits the special tokens too and encode them seperately.Expected behavior
should append the ids from the special_tokens present in the vocab. instead it does split the tokens into sub and then perform the normal encoding.
Screenshots