dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
648 stars 49 forks source link

NPM package is very huge #68

Open anzemur opened 10 months ago

anzemur commented 10 months ago

I just noticed that this package is around 13MB when unpackaged and I reached my AWS Lambda package size limit. This is absolutely too big for serverless deployment.

So my questions are:

anzemur commented 10 months ago

@dqbd Even in JS version, there is an index file with all of the encoders inside (3MB+) and there are also separate encoder files in ranks directory and chunks.. why is that so? The "light" version doesn't do nothing about package size - the light version should include only the code for the tokenizer not the actual encoders - they can be loaded from CDN or from local directory..

dqbd commented 10 months ago

Hi @anzemur! Regarding the size of the dependency, the default entrypoint does include all BPE ranks for each of the encoder, whereas the js-tiktoken/lite and tiktoken/lite includes only the core logic without the ranks.

The unpackaged size reported by npm takes into account the raw size of the package folder found in node_modules, which may not represent the actual size being used in your projects. Your bundler should be able to perform basic tree shaking to avoid importing unnecessary code.

Consider the following code snippet, which can be successfully deployed Vercel on Hobby plan with 1 MB code size limit (as of 30/08/2023).

import { Tiktoken } from "js-tiktoken/lite";
import cl100k_base from "js-tiktoken/ranks/cl100k_base";

export const config = { runtime: "edge" };

export default async function () {
  const encoding = new Tiktoken(cl100k_base);
  const tokens = encoding.encode("hello world");
  return new Response(`${tokens}`);
}

Ideally though, the ranks should be fetched via CDN, as seen in Langchain PR: https://github.com/hwchase17/langchainjs/pull/1239, which drops the bundle size down to 4.5kB (using esbuild, which is also internally used by Vercel dev command)

dqbd commented 10 months ago

Regarding extensions .js, .cjs files for ranks are mostly there for compatibility reasons with interop between ESM modules and CJS module, while .json is offered for users who might want to fetch the BPE ranks from other CDNs such as esm.sh.

Assuming your initial question, you might be just zipping the entire project with node_modules. You might want to minify your code before, as seen in AWS samples repo: https://github.com/aws-samples/lambda-nodejs-esbuild

seyfer commented 9 months ago

@anzemur there is also another package you might consider to use https://github.com/niieani/gpt-tokenizer

ajayvignesh01 commented 3 months ago

For anyone coming to this now, this is the new way to do it:

import { Tiktoken } from 'js-tiktoken/lite'

const getTokenModel = async () => {
  const response = await fetch('https://tiktoken.pages.dev/js/cl100k_base.json')
  return await response.json()
}
const rank = await getTokenModel()
const tokenizer = new Tiktoken(rank)
const tokens = tokenizer.encode('Hello World').length

You could also save the json file to your app directory and import it into the function. Works on Vercel Hobby plan.