Dan-wanna-M / formatron

Formatron empowers everyone to control the format of language models' output with minimal overhead.
MIT License
163 stars 6 forks source link

Custom HuggingFace tokenizers #22

Closed lukaszkolodziejczyk closed 3 weeks ago

lukaszkolodziejczyk commented 1 month ago

Hello!

I work with a custom HF tokenizer, that has some tokens mapped to different unicode characters (e.g. space expressed as thick underscore).

I see that those mappings are handled on get_original_characters, where new_vocab is built based on tokenizer.get_vocab(). Currently though, the vocabulary processors are autodetected (with _autodetect_processors function), and there is no flexibility for the user to specify their own vocabulary processors.

What do you think about introducing such flexibility? I made a quick draft, to show how I envision that: https://github.com/Dan-wanna-M/formatron/pull/21.

All the best!