HF tokenizer will replace certain unicode characters with a space ' '. Therefore, the token-level prediction will become shorter than the input, which can cause mis-matched sequences. This PR tries to fix this issue via enabling replacing such unicode characters with the [UNK] tokens. We replace unicode characters in certain "categories", namely, ["Cc", "Cf", "Co", "Cs", "Mn", "Zl", "Zp", "Zs"], as specified by the rules in the corresponding HF tokenizer:
A future update could be just replacing the unicode characters in the cached file examples/find-empty-unicode-chars/zero-length-unicode-chars.txt, which we've tested and confirmed that has zero tokenization lengths.
HF tokenizer will replace certain unicode characters with a space
' '
. Therefore, the token-level prediction will become shorter than the input, which can cause mis-matched sequences. This PR tries to fix this issue via enabling replacing such unicode characters with the[UNK]
tokens. We replace unicode characters in certain "categories", namely,["Cc", "Cf", "Co", "Cs", "Mn", "Zl", "Zp", "Zs"]
, as specified by the rules in the corresponding HF tokenizer:Usage:
A future update could be just replacing the unicode characters in the cached file
examples/find-empty-unicode-chars/zero-length-unicode-chars.txt
, which we've tested and confirmed that has zero tokenization lengths.