anthropics / anthropic-tokenizer-typescript

MIT License
49 stars 2 forks source link

Support for Vercel Serverless and Edge #6

Open vshei opened 1 year ago

vshei commented 1 year ago

Hello, The underlying package being used (https://github.com/dqbd/tiktoken) seems to run into issues in a Vercel Serverless environment. Our application currently is built on NextJS 13 and we are seeing this error in our logs: Error: Missing tiktoken_bg.wasm

We saw this issue before when we tried using the dqpd/tiktoken library directly. We had to switch to using js-tiktoken to resolve this issue.

Per the README in the GitHub repo it seems like this is the difference between the two: tiktoken (formally hosted at @dqbd/tiktoken): WASM bindings for the original Python library, providing full 1-to-1 feature parity. js-tiktoken: Pure JavaScript port of the original library with the core functionality, suitable for environments where WASM is not well supported or not desired (such as edge runtimes).

I was wondering if it was possible to build a version using the js-tiktoken library for better portability and for folks on environments where WASM is not easy to work with. The error and fix (i.e. creation of js-tiktoken) can be seen here: https://github.com/transitive-bullshit/chatgpt-api/issues/570

Thanks!

iwasrobbed commented 1 year ago

You can just use js-tiktoken with the claude.json file from this repo:

import claude from './claude.json'
import { Tiktoken, TiktokenBPE } from 'js-tiktoken'

// Modified from: https://github.com/anthropics/anthropic-tokenizer-typescript
// (they use an old version of Tiktoken that isn't edge compatible)

export function countTokens(text: string): number {
  const tokenizer = getTokenizer()
  const encoded = tokenizer.encode(text.normalize('NFKC'), 'all')
  return encoded.length
}

// ----------------------
// Private APIs
// ----------------------

const getTokenizer = (): Tiktoken => {
  const ranks: TiktokenBPE = {
    bpe_ranks: claude.bpe_ranks,
    special_tokens: claude.special_tokens,
    pat_str: claude.pat_str,
  }
  return new Tiktoken(ranks)
}
Mypathissional commented 1 year ago

Is the explicit number of tokens mentioned in the claude.json correct?