dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
649 stars 49 forks source link

Automatically determine the environment #43

Open louis030195 opened 1 year ago

louis030195 commented 1 year ago

Hi!

This is a wonderful lib that I use in https://github.com/different-ai/embedbase/tree/main/sdk/embedbase-js

I was wondering if it could be possible to automatically determine the environment and import accordingly (e.g. browser, nextjs, vercel edge, etc)? It would be really great and reduce verbosity. Happy to contribute

dqbd commented 1 year ago

Hello @louis030195!

I believe it could be possible through exports field in package.json, where we could specify different entrypoint for a different engine: https://runtime-keys.proposal.wintercg.org/

I do think it does necessitate an API change, exposing a magic init() function which will perform the import() of WASM files when needed (?module suffix for Vercel etc).

At the moment I'm swarmed with a lot of work, but if you want to take a knack at it, would love to help 😃

louis030195 commented 1 year ago

Interesting, btw do you know how to unit test different environment? Maybe https://edge-runtime.vercel.app/packages/vm?

ATM found this hack somewhere (probably from you) which works both in node and vercel:

export async function getEncoding(
    encoding: TiktokenEncoding,
    options?: {
        signal?: AbortSignal;
        extendedSpecialTokens?: Record<string, number>;
    }
) {
    if (!(encoding in cache)) {
        cache[encoding] =
            getFetch()(`https://tiktoken.pages.dev/js/${encoding}.json`, {
                signal: options?.signal,
            })
                .then((res) => res.json())
                .catch((e) => {
                    delete cache[encoding];
                    throw e;
                });
    }
    const enc = new Tiktoken(await cache[encoding], options?.extendedSpecialTokens);
    // // @ts-ignore
    // const registry = new FinalizationRegistry((heldValue) => {
    //     heldValue.free()
    // });
    // registry.register(enc);
    return enc
}

I'm not 100% sure where the .free() went, bit afraid of memory leaks

dqbd commented 1 year ago

Some preliminary work has been done here: https://github.com/dqbd/package-exports-edge, where different dev environments are set up and tested using vitest with puppeteer etc.

The snippet was most likely from the recent PR done on LangChain, where I've swapped the WASM bindings with a JS port instead (js-tiktoken). The rationale was the excessive bundle size and relative complexity to get started with LangchainJS: https://github.com/hwchase17/langchainjs/pull/1239. That is why no .free() is needed 😄

You might want to consider swapping the implementation as well, considering the trade-offs such as worse performance (around 5x slower, although a PR can reduce the difference to just 2.5x slower).

Regardless, I think the general goal would be to create an isomorphic library, which will load the JS version on edge and WASM in Node.