The script tokviz/visualization.py can and should have functionality to visualize custom and local tokenizers. Start with HF Transformers' class PreTrainedTokenizerFast for ease. This lets people visualize their trained tokenizers w/out having upload them to HF and register with AutoModel.
E.g.:
from IPython.display import HTML, display
from transformers import AutoTokenizer, PreTrainedTokenizerFast
def token_visualizer(text, models=['openai-community/gpt2'], local=False):
"""
Compares tokenization patterns across different language models and visualizes the results.
Args:
text (str): The input text to tokenize and compare.
models (list): A list of language model names or identifiers to compare; can also be an abosolute path to a local tokenizer directory.
Default is ['gpt-2'].
kocal (bool): A flag to indicate whether or not the tokenizer will be a local Pretrained tokenizer or not
"""
for model in models:
if not local:
tokenizer = AutoTokenizer.from_pretrained(model)
tokens = tokenizer.tokenize(text)
else:
tokenizer = PreTrainedTokenizerFast.from_pretrained(model)
tokens = tokenizer.encode(text)
# Continue as before
Purpose
The script
tokviz/visualization.py
can and should have functionality to visualize custom and local tokenizers. Start with HF Transformers' classPreTrainedTokenizerFast
for ease. This lets people visualize their trained tokenizers w/out having upload them to HF and register withAutoModel
.E.g.: