IAPark / tiktoken_ruby

Unofficial ruby binding for tiktoken by way of rust
MIT License
118 stars 26 forks source link

[Proposal] Claude Tokenizer #15

Open BarberAlec opened 1 year ago

BarberAlec commented 1 year ago

tiktoken_ruby gem currently supports 4 encoders:

Claude appears to use tiktoken parameters outlined here and implemented here.

The BPE rankings are in an alternate format but doing some reverse engineering by looking at the javascript tiktoken implementation here I was able to use the following code to create a tiktoken encoder for Claude in Python. Note claude.json was sourced from the referenced javascript tiktoken library which is apart of the official Anthropic account.

import tiktoken
import json
import base64

def decode_claude_bpe(claude_configs):
    _, offset, *tokens = claude_configs['bpe_ranks'].split(" ")
    offset = int(offset)

    # This starts at 5 (offset) for some reason, this is what the original JS code does
    rankMap = {base64.b64decode(token): offset+idx for idx, token in enumerate(tokens)}

    return rankMap

if __name__ == "__main__":
    with open("claude.json") as f:
        claude_configs = json.load(f)
        bpe_ranks = decode_claude_bpe(claude_configs)

    enc = tiktoken.Encoding(
        name="claude_tokenizer",
        pat_str=claude_configs['pat_str'],
        mergeable_ranks=bpe_ranks,
        special_tokens=claude_configs['special_tokens'],
    )
    print(enc.encode("hello world"))

Alternatively an option to create a tiktoken encoder using custom BPE ranks etc. like in the Python library would be a more general solution.

IAPark commented 1 year ago

I do prefer the idea of creating a general solution. I think adding explicit Claude support moves away from the idea of a wrapper