genesis-ai-dev / codex-editor

Codex Scripture Editor and Translator's Copilot
https://codex-editor.gitbook.io/
MIT License
7 stars 7 forks source link

Faster Tokenizer Alternatives #32

Closed dadukhankevin closed 4 months ago

dadukhankevin commented 7 months ago

Most tokenizers train very slowly on even moderate amounts of text. I've started building a tokenizer that employs genetic algorithms to achieve the same results but faster:

The general idea is like this:

The best individuals from every iteration are added to the tokenizer.

This approach is much faster and can produce several thousand tokens from the Bible in about 10 seconds. I still need to do more testing to make sure the tokens are actually useful, but it generally has a lot of overlap as the tokens generated by other tokenizers, but the genetic tokenizer is much faster.

Most tokenizers seem to be:

I'm getting this integrated into the main extension right now but got sidetracked by errors when having multiple vector databases open at the same time. Hopefully, I'll be able to push before Wednesday. This idea still has a long way to go, but I think it might be helpful, we'll see.