I work with a custom HF tokenizer, that has some tokens mapped to different unicode characters (e.g. space expressed as thick underscore).
I see that those mappings are handled on get_original_characters, where new_vocab is built based on tokenizer.get_vocab(). Currently though, the vocabulary processors are autodetected (with _autodetect_processors function), and there is no flexibility for the user to specify their own vocabulary processors.
Hello!
I work with a custom HF tokenizer, that has some tokens mapped to different unicode characters (e.g. space expressed as thick underscore).
I see that those mappings are handled on
get_original_characters
, wherenew_vocab
is built based ontokenizer.get_vocab()
. Currently though, the vocabulary processors are autodetected (with_autodetect_processors
function), and there is no flexibility for the user to specify their own vocabulary processors.What do you think about introducing such flexibility? I made a quick draft, to show how I envision that: https://github.com/Dan-wanna-M/formatron/pull/21.
All the best!