Custom HuggingFace tokenizers

Hello!

I work with a custom HF tokenizer, that has some tokens mapped to different unicode characters (e.g. space expressed as thick underscore).

I see that those mappings are handled on get_original_characters, where new_vocab is built based on tokenizer.get_vocab(). Currently though, the vocabulary processors are autodetected (with _autodetect_processors function), and there is no flexibility for the user to specify their own vocabulary processors.