dqbd / tiktoken

JS port and JS/WASM bindings for openai/tiktoken
MIT License
649 stars 49 forks source link

decode issue #24

Closed loretoparisi closed 1 year ago

loretoparisi commented 1 year ago

Hello, I did

const enc = encoding_for_model("gpt-3.5-turbo", {
        "<|im_start|>": 100264,
        "<|im_end|>": 100265,
        "<|im_sep|>": 100266,
    });
console.log(
        enc.encode("hello world"),
        enc.decode(enc.encode("hello world"))
    );

but I get

Uint32Array(2) [ 15339, 1917 ] 
Uint8Array(11) [
  104, 101, 108, 108,
  111,  32, 119, 111,
  114, 108, 100
]
dqbd commented 1 year ago

Hello, you need to pass the output of enc.decode to a TextDecoder.decode()

const { encoding_for_model } = require("@dqbd/tiktoken");
const textDecoder = new TextDecoder();
const enc = encoding_for_model("gpt-3.5-turbo", {
  "<|im_start|>": 100264,
  "<|im_end|>": 100265,
  "<|im_sep|>": 100266,
});
console.log(
  enc.encode("hello world"),
  textDecoder.decode(enc.decode(enc.encode("hello world")))
);
loretoparisi commented 1 year ago

Thanks it works using TextDecoder