Open anzemur opened 10 months ago
@dqbd Even in JS version, there is an index file with all of the encoders inside (3MB+) and there are also separate encoder files in ranks directory and chunks.. why is that so? The "light" version doesn't do nothing about package size - the light version should include only the code for the tokenizer not the actual encoders - they can be loaded from CDN or from local directory..
Hi @anzemur!
Regarding the size of the dependency, the default entrypoint does include all BPE ranks for each of the encoder, whereas the js-tiktoken/lite
and tiktoken/lite
includes only the core logic without the ranks.
The unpackaged size reported by npm
takes into account the raw size of the package folder found in node_modules
, which may not represent the actual size being used in your projects. Your bundler should be able to perform basic tree shaking to avoid importing unnecessary code.
Consider the following code snippet, which can be successfully deployed Vercel on Hobby plan with 1 MB code size limit (as of 30/08/2023).
import { Tiktoken } from "js-tiktoken/lite";
import cl100k_base from "js-tiktoken/ranks/cl100k_base";
export const config = { runtime: "edge" };
export default async function () {
const encoding = new Tiktoken(cl100k_base);
const tokens = encoding.encode("hello world");
return new Response(`${tokens}`);
}
Ideally though, the ranks should be fetched via CDN, as seen in Langchain PR: https://github.com/hwchase17/langchainjs/pull/1239, which drops the bundle size down to 4.5kB (using esbuild
, which is also internally used by Vercel dev command)
Regarding extensions .js
, .cjs
files for ranks are mostly there for compatibility reasons with interop between ESM modules and CJS module, while .json
is offered for users who might want to fetch the BPE ranks from other CDNs such as esm.sh.
Assuming your initial question, you might be just zipping the entire project with node_modules
. You might want to minify your code before, as seen in AWS samples repo: https://github.com/aws-samples/lambda-nodejs-esbuild
@anzemur there is also another package you might consider to use https://github.com/niieani/gpt-tokenizer
For anyone coming to this now, this is the new way to do it:
import { Tiktoken } from 'js-tiktoken/lite'
const getTokenModel = async () => {
const response = await fetch('https://tiktoken.pages.dev/js/cl100k_base.json')
return await response.json()
}
const rank = await getTokenModel()
const tokenizer = new Tiktoken(rank)
const tokens = tokenizer.encode('Hello World').length
You could also save the json file to your app directory and import it into the function. Works on Vercel Hobby plan.
I just noticed that this package is around 13MB when unpackaged and I reached my AWS Lambda package size limit. This is absolutely too big for serverless deployment.
So my questions are: