Faster Tokenizer Alternatives

Most tokenizers train very slowly on even moderate amounts of text. I've started building a tokenizer that employs genetic algorithms to achieve the same results but faster:

genetok

The general idea is like this:

Split text into large chunks
Pick many random sub-portions of this text between 1 and 8 characters long, represent these portions as integer ranges and place them as individuals in a genetic algorithm.
Mutate the ranges by shifting them left or right, shrinking, or stretching them.

Assign a "fitness" score to each using this fitness function:

def fitness(self, individual: RangeToken):
    token = individual.token
    if token in self.fitness_results:
        return self.fitness_results[token]

    source_text = individual.source
    count = source_text.count(token)
    percent = count / individual.length

    score = ((len(token) + len(self.tokenize(token))) * percent) # the less tokenized the text, the higher fitness 
    self.fitness_results.update({token: score})

    return score

Essentially this takes the randomly generated token and returns its fitness if it is already known, but if not determines it based on:

its length
how often it appears in the text
and how "tokenizable" it already is. This means that the more tokens it takes to tokenize the subtext, the more the algorithm will prioritize this individual as it represents something very new to the tokenizer, making it more important to learn.
Crossover happens by combining random tokens, and finding them in the source text, and then converting that back into a new range.

The best individuals from every iteration are added to the tokenizer.

This approach is much faster and can produce several thousand tokens from the Bible in about 10 seconds. I still need to do more testing to make sure the tokens are actually useful, but it generally has a lot of overlap as the tokens generated by other tokenizers, but the genetic tokenizer is much faster.

Most tokenizers seem to be:

Brute force algorithms that try to incrementally find the next most common "token". The genetic tokenizer probably works well because common tokens are more likely to be randomly picked, by nature than they are common.

I'm getting this integrated into the main extension right now but got sidetracked by errors when having multiple vector databases open at the same time. Hopefully, I'll be able to push before Wednesday. This idea still has a long way to go, but I think it might be helpful, we'll see.

genesis-ai-dev / codex-editor

Faster Tokenizer Alternatives #32