Closed enricoros closed 11 months ago
Hi @enricoros!
Looking at the tokenizer.py
file, it seems like they are using huggingface/tokenizer
, which do already have NodeJS bindings: https://www.npmjs.com/package/tokenizers.
Here is an example code using huggingface/tokenizer
:
const util = require("util");
const { Tokenizer } = require("tokenizers");
let tokenizer = Tokenizer.fromFile("claude-v1-tokenization.json");
const encode = util.promisify(tokenizer.encode.bind(tokenizer));
const decode = util.promisify(tokenizer.decode.bind(tokenizer));
async function main() {
const encoded = await encode("Hello from Anthropic!");
console.log({ encoded: encoded.getIds() });
const decoded = await decode(
encoded.getIds(),
true // skipSpecialTokens: true
);
console.log({ decoded });
}
main();
However, it does seem that the huggingface/tokenizers
have some issues with supporting newer NodeJS versions and/or arm64 support. Will look into it, if there is some overlap between tiktoken and default tokenizer.
It's cool that afaict the core is also in rust? I could see this following a similar pattern to what we have with tiktoken :D
@enricoros Some progress (with experimental JSON configs for @dqbd/tiktoken) can be seen here: https://github.com/dqbd/tiktokenizer/pull/5
Demo of Tiktokenizer playground: https://tiktokenizer-git-custom-bpe-models-dqbd.vercel.app/
Very interesting approach, and I love the playground too. Thanks for the update! I believe this bug can be closed now, as you got it to work!
Anthropic has released the models for research, and has opened their code on GitHub: https://github.com/anthropics/anthropic-sdk-python/blob/main/anthropic/tokenizer.py
In this repo, there's a link to a file:
CLAUDE_TOKENIZER_REMOTE_FILE = "https://public-json-tokenization-0d8763e8-0d7e-441b-a1e2-1c73b8e79dc3.storage.googleapis.com/claude-v1-tokenization.json"
Can this help in extending Tiktoken to support 'claude-v1' models?