Tokenizer Mirrors & Trainers

Summary

With this PR we slightly refactor the tokenization pipeline in the library.

First, we provide some extended 🤗/transformers tokenizers to support additional features and backend tokenization properties (e.g. customizable tokenization pipeline steps). Second, we slightly reduce the tokenization modules which were supposed to be the proxy objects for tokenizers with built-in training procedure. This didn't work out since it creates a differentiation overhead between tokenizers and tokenizer modules. We decided to reduce tokenizer modules into tokenizer trainers with the one purpose – training a new tokenizer. Tokenizer trainers, in fact, connect backend tokenization utilities (provided by 🤗/tokenizers) with 🤗/transformers tokenizers. That's it, no more feature proxying.

From now on, if you need a custom tokenizer – you probably also need a trainer. Just look at the existing classes to inherit from or create a new one similar to the existing tokenizers and trainers. We're gonna extend built-in tokenizers and trainers in the next releases.

Also note, that we deprecate using slow tokenizers from transformers library as they don't utilize the fast Rust backend. Our built-in tokenizers are fast tokenizers working with Rust tokenization backend.

formermagic / formerbox

Tokenizer Mirrors & Trainers #22

Summary