I was looking through the OpenAI code and noticed that the wrong tokenizer is used for newer models like text-embedding-ada-002 that use cl100k, implemented by tiktoken.
There is a list of encodings here for their public models.
I'm currently looking at making a wasm build of tiktoken, though I think a pure js approach would also work fine.
I was looking through the OpenAI code and noticed that the wrong tokenizer is used for newer models like
text-embedding-ada-002
that usecl100k
, implemented by tiktoken.There is a list of encodings here for their public models.
I'm currently looking at making a wasm build of tiktoken, though I think a pure js approach would also work fine.