Closed killthebuddh4 closed 1 year ago
Hello @killthebuddh4!
Sorry for the delay, in general it is possible to obtain the tokens used by pulling the BPE ranks. Here is a PoC:
import cl100k_base from "@dqbd/tiktoken/encoders/cl100k_base.json";
const uncompressed = cl100k_base.bpe_ranks
.split("\n")
.filter(Boolean)
.reduce<Record<string, number>>((memo, x) => {
const [_, offsetStr, ...tokens] = x.split(" ");
const offset = Number.parseInt(offsetStr, 10);
tokens.forEach((token, i) => (memo[token] = offset + i));
return memo;
}, {});
const cache = new Map<number, string>();
for (const [token, rank] of Object.entries(uncompressed)) {
cache.set(rank, Buffer.from(token, "base64").toString("utf-8"));
}
console.log(cache)
Will consider exposing the "decompression" code in later versions, but this should work fine, let me know 😄
This is great, thank you!
Thanks for the great work!
I'm wondering, would it make sense for this library to export the vocabulary for a particular encoding? My goal is to implement something like this https://github.com/r2d4/rellm but in TypeScript, and I'm not sure what is the best way to access the vocabulary.