Revamped flags for the tokenizer CLI. Now it supports providing BOS, EOS, and PAD tokens.
Added a deprecation warning for older CLI flag.
Added tests for tokenizer.
Added experimental feature to split paragraphs before tokenization. This speeds up tokenizers like Llama and Mixtral significantly by getting around the inefficient Replacenormalizer in HuggingFace'stokenizers` library. Should be no longer needed once this PR is merged in.
normalizer in HuggingFace's
tokenizers` library. Should be no longer needed once this PR is merged in.