dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
649 stars 49 forks source link

Extending special tokens on lite package #30

Closed stephenasuncionDEV closed 1 year ago

stephenasuncionDEV commented 1 year ago

How would I extend the special tokens with the Lite package? Do I need to extend it and add the im_start etc.. or do I just serialize the chat message and pass it into the encoder?

const encoding = new Tiktoken(
    model.bpe_ranks,
    model.special_tokens,
    model.pat_str,
  );

I'm trying to extend special tokens on the lite package for chatML like the following:

function getChatGPTEncoding(
  messages: { role: string; content: string; name: string }[],
  model: "gpt-3.5-turbo" | "gpt-4" | "gpt-4-32k"
) {
  const isGpt3 = model === "gpt-3.5-turbo";

  const encoder = encoding_for_model(model, {
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  });

  const msgSep = isGpt3 ? "\n" : "";
  const roleSep = isGpt3 ? "\n" : "<|im_sep|>";

  const serialized = [
    messages
      .map(({ name, role, content }) => {
        return `<|im_start|>${name || role}${roleSep}${content}<|im_end|>`;
      })
      .join(msgSep),
    `<|im_start|>assistant${roleSep}`,
  ].join(msgSep);

  return encoder.encode(serialized, "all");
}

Originally posted by @dqbd in https://github.com/dqbd/tiktoken/issues/23#issuecomment-1483880292

dqbd commented 1 year ago

Hello @stephenasuncionDEV!

The easiest thing would be to spread your special tokens.

const encoding = new Tiktoken(
  model.bpe_ranks,
  {
    ...model.special_tokens,
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  },
  model.pat_str,
);
stephenasuncionDEV commented 1 year ago

Hello @stephenasuncionDEV!

The easiest thing would be to spread your special tokens.

const encoding = new Tiktoken(
  model.bpe_ranks,
  {
    ...model.special_tokens,
    "<|im_start|>": 100264,
    "<|im_end|>": 100265,
    "<|im_sep|>": 100266,
  },
  model.pat_str,
);

I seem to be getting "The text containing a special token that is not allowed: <|im_start|>" when extending the special token object.

export const getTokenCount = async (
  messages: ProviderMessage[],
  model: ProviderModel,
) => {
  await init((imports) => WebAssembly.instantiate(wasm, imports));

  const encoding = new Tiktoken(
    cl100k_base.bpe_ranks,
    {
      ...cl100k_base.special_tokens,
      "<|im_start|>": 100264,
      "<|im_end|>": 100265,
      "<|im_sep|>": 100266,
    },
    cl100k_base.pat_str,
  );

  const isGpt3 = model === "gpt-3.5-turbo";
  const msgSep = isGpt3 ? "\n" : "";
  const roleSep = isGpt3 ? "\n" : "<|im_sep|>";

  const serialized = [
    messages
      .map(({ role, content }) => {
        return `<|im_start|>${role}${roleSep}${content}<|im_end|>`;
      })
      .join(msgSep),
    `<|im_start|>assistant${roleSep}`,
  ].join(msgSep);

  const tokens = encoding.encode(serialized);
  encoding.free();

  return tokens.length;
};
dqbd commented 1 year ago

By default if an special token is passed in encode, the encoder aborts.

Use the following code:

const tokens = encoding.encode(serialized, "all");

or

const tokens = encoding.encode(serialized, ["<|im_start|>", "<|im_sep|>", "<|im_end|>"]);
stephenasuncionDEV commented 1 year ago

Solved, Thank you!