Open vshei opened 1 year ago
You can just use js-tiktoken with the claude.json
file from this repo:
import claude from './claude.json'
import { Tiktoken, TiktokenBPE } from 'js-tiktoken'
// Modified from: https://github.com/anthropics/anthropic-tokenizer-typescript
// (they use an old version of Tiktoken that isn't edge compatible)
export function countTokens(text: string): number {
const tokenizer = getTokenizer()
const encoded = tokenizer.encode(text.normalize('NFKC'), 'all')
return encoded.length
}
// ----------------------
// Private APIs
// ----------------------
const getTokenizer = (): Tiktoken => {
const ranks: TiktokenBPE = {
bpe_ranks: claude.bpe_ranks,
special_tokens: claude.special_tokens,
pat_str: claude.pat_str,
}
return new Tiktoken(ranks)
}
Is the explicit number of tokens mentioned in the claude.json correct?
Hello, The underlying package being used (https://github.com/dqbd/tiktoken) seems to run into issues in a Vercel Serverless environment. Our application currently is built on NextJS 13 and we are seeing this error in our logs:
Error: Missing tiktoken_bg.wasm
We saw this issue before when we tried using the
dqpd/tiktoken
library directly. We had to switch to usingjs-tiktoken
to resolve this issue.Per the README in the GitHub repo it seems like this is the difference between the two: tiktoken (formally hosted at @dqbd/tiktoken): WASM bindings for the original Python library, providing full 1-to-1 feature parity. js-tiktoken: Pure JavaScript port of the original library with the core functionality, suitable for environments where WASM is not well supported or not desired (such as edge runtimes).
I was wondering if it was possible to build a version using the
js-tiktoken
library for better portability and for folks on environments where WASM is not easy to work with. The error and fix (i.e. creation ofjs-tiktoken
) can be seen here: https://github.com/transitive-bullshit/chatgpt-api/issues/570Thanks!