deeepsig / tokviz

tokviz is a Python library for visualizing tokenization patterns across different language models.
https://pypi.org/project/tokviz/
MIT License
9 stars 0 forks source link

Add custom tokenizer support (see code) #2

Open DrewGalbraith opened 6 days ago

DrewGalbraith commented 6 days ago

Purpose

The script tokviz/visualization.py can and should have functionality to visualize custom and local tokenizers. Start with HF Transformers' class PreTrainedTokenizerFast for ease. This lets people visualize their trained tokenizers w/out having upload them to HF and register with AutoModel.

E.g.:

from IPython.display import HTML, display
from transformers import AutoTokenizer, PreTrainedTokenizerFast

def token_visualizer(text, models=['openai-community/gpt2'], local=False):
    """
    Compares tokenization patterns across different language models and visualizes the results.

    Args:
        text (str): The input text to tokenize and compare.
        models (list): A list of language model names or identifiers to compare; can also be an abosolute path to a local tokenizer directory.
                       Default is ['gpt-2'].
        kocal (bool): A flag to indicate whether or not the tokenizer will be a local Pretrained tokenizer or not
    """
    for model in models:
        if not local:
            tokenizer = AutoTokenizer.from_pretrained(model)
            tokens = tokenizer.tokenize(text)
        else:
            tokenizer = PreTrainedTokenizerFast.from_pretrained(model)
            tokens = tokenizer.encode(text)
        # Continue as before
DrewGalbraith commented 6 days ago

It actually supports the above class well enough as is. Wouldn't mind seeing this expanded to other more custom tokenizers though :) low priority.