dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
652 stars 48 forks source link

feat: add option to extend special tokens and to provide custom ranks #2

Closed dqbd closed 1 year ago

dqbd commented 1 year ago

This PR implements the following features:

Creating custom encoders

import { readFileSync } from "fs";

const encoder = new Tiktoken(
  readFileSync("./ranks/gpt2.tiktoken").toString("utf-8"),
  { "<|endoftext|>": 50256, "<|im_start|>": 100264, "<|im_end|>": 100265 },
  "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"
);

Extending existing encoders with additional special tokens

const encoder = encoding_for_model("gpt2", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
})

Closes #1