dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
649 stars 49 forks source link

Anthropic models #27

Closed enricoros closed 11 months ago

enricoros commented 1 year ago

Anthropic has released the models for research, and has opened their code on GitHub: https://github.com/anthropics/anthropic-sdk-python/blob/main/anthropic/tokenizer.py

In this repo, there's a link to a file: CLAUDE_TOKENIZER_REMOTE_FILE = "https://public-json-tokenization-0d8763e8-0d7e-441b-a1e2-1c73b8e79dc3.storage.googleapis.com/claude-v1-tokenization.json"

Can this help in extending Tiktoken to support 'claude-v1' models?

dqbd commented 1 year ago

Hi @enricoros!

Looking at the tokenizer.py file, it seems like they are using huggingface/tokenizer, which do already have NodeJS bindings: https://www.npmjs.com/package/tokenizers.

Here is an example code using huggingface/tokenizer:

const util = require("util");
const { Tokenizer } = require("tokenizers");

let tokenizer = Tokenizer.fromFile("claude-v1-tokenization.json");

const encode = util.promisify(tokenizer.encode.bind(tokenizer));
const decode = util.promisify(tokenizer.decode.bind(tokenizer));

async function main() {
  const encoded = await encode("Hello from Anthropic!");
  console.log({ encoded: encoded.getIds() });

  const decoded = await decode(
    encoded.getIds(),
    true // skipSpecialTokens: true
  );

  console.log({ decoded });
}

main();

However, it does seem that the huggingface/tokenizers have some issues with supporting newer NodeJS versions and/or arm64 support. Will look into it, if there is some overlap between tiktoken and default tokenizer.

https://github.com/huggingface/tokenizers/issues/911

darknoon commented 1 year ago

It's cool that afaict the core is also in rust? I could see this following a similar pattern to what we have with tiktoken :D

dqbd commented 1 year ago

@enricoros Some progress (with experimental JSON configs for @dqbd/tiktoken) can be seen here: https://github.com/dqbd/tiktokenizer/pull/5

Demo of Tiktokenizer playground: https://tiktokenizer-git-custom-bpe-models-dqbd.vercel.app/

enricoros commented 1 year ago

Very interesting approach, and I love the playground too. Thanks for the update! I believe this bug can be closed now, as you got it to work!