cl100k_base vocabulary export

killthebuddh4 commented 1 year ago

Thanks for the great work!

I'm wondering, would it make sense for this library to export the vocabulary for a particular encoding? My goal is to implement something like this https://github.com/r2d4/rellm but in TypeScript, and I'm not sure what is the best way to access the vocabulary.

dqbd commented 1 year ago

Hello @killthebuddh4!

Sorry for the delay, in general it is possible to obtain the tokens used by pulling the BPE ranks. Here is a PoC:

import cl100k_base from "@dqbd/tiktoken/encoders/cl100k_base.json";

const uncompressed = cl100k_base.bpe_ranks
  .split("\n")
  .filter(Boolean)
  .reduce<Record<string, number>>((memo, x) => {
    const [_, offsetStr, ...tokens] = x.split(" ");
    const offset = Number.parseInt(offsetStr, 10);
    tokens.forEach((token, i) => (memo[token] = offset + i));
    return memo;
  }, {});

const cache = new Map<number, string>();
for (const [token, rank] of Object.entries(uncompressed)) {
  cache.set(rank, Buffer.from(token, "base64").toString("utf-8"));
}

console.log(cache)

Will consider exposing the "decompression" code in later versions, but this should work fine, let me know 😄

killthebuddh4 commented 1 year ago

This is great, thank you!

dqbd / tiktoken

cl100k_base vocabulary export #36