Wrong tokenizer used for OpenAI embeddings

cfortuner / promptable

Build LLM apps in Typescript/Javascript. 🧑‍💻 🧑‍💻 🧑‍💻 🚀 🚀 🚀

https://docs-promptable.vercel.app

MIT License

1.77k stars 120 forks source link

Wrong tokenizer used for OpenAI embeddings #31

Open darknoon opened 1 year ago

darknoon commented 1 year ago

I was looking through the OpenAI code and noticed that the wrong tokenizer is used for newer models like text-embedding-ada-002 that use cl100k, implemented by tiktoken.

There is a list of encodings here for their public models.

I'm currently looking at making a wasm build of tiktoken, though I think a pure js approach would also work fine.

cfortuner commented 1 year ago

This might work -> https://www.npmjs.com/package/@dqbd/tiktoken @darknoon

Let me know