dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
705 stars 53 forks source link

Expose special_tokens in the API #1

Closed darknoon closed 1 year ago

darknoon commented 1 year ago

If you look at the base code, this is their example code:

cl100k_base = tiktoken.get_encoding("cl100k_base")

# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
    # If you're changing the set of special tokens, make sure to use a different name
    # It should be clear from the name what behaviour to expect.
    name="cl100k_im",
    pat_str=cl100k_base._pat_str,
    mergeable_ranks=cl100k_base._mergeable_ranks,
    special_tokens={
        **cl100k_base._special_tokens,
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
    }
)

We need more or less the same thing (ability to pass in custom tokens and their corresponding IDs) for our app. Currently don't care about pat_str or mergeable_ranks.