If you look at the base code, this is their example code:
cl100k_base = tiktoken.get_encoding("cl100k_base")
# In production, load the arguments directly instead of accessing private attributes
# See openai_public.py for examples of arguments for specific encodings
enc = tiktoken.Encoding(
# If you're changing the set of special tokens, make sure to use a different name
# It should be clear from the name what behaviour to expect.
name="cl100k_im",
pat_str=cl100k_base._pat_str,
mergeable_ranks=cl100k_base._mergeable_ranks,
special_tokens={
**cl100k_base._special_tokens,
"<|im_start|>": 100264,
"<|im_end|>": 100265,
}
)
We need more or less the same thing (ability to pass in custom tokens and their corresponding IDs) for our app. Currently don't care about pat_str or mergeable_ranks.
If you look at the base code, this is their example code:
We need more or less the same thing (ability to pass in custom tokens and their corresponding IDs) for our app. Currently don't care about
pat_str
ormergeable_ranks
.