Most tokenizers train very slowly on even moderate amounts of text. I've started building a tokenizer that employs genetic algorithms to achieve the same results but faster:
Pick many random sub-portions of this text between 1 and 8 characters long, represent these portions as integer ranges and place them as individuals in a genetic algorithm.
Mutate the ranges by shifting them left or right, shrinking, or stretching them.
Assign a "fitness" score to each using this fitness function:
def fitness(self, individual: RangeToken):
token = individual.token
if token in self.fitness_results:
return self.fitness_results[token]
source_text = individual.source
count = source_text.count(token)
percent = count / individual.length
score = ((len(token) + len(self.tokenize(token))) * percent) # the less tokenized the text, the higher fitness
self.fitness_results.update({token: score})
return score
Essentially this takes the randomly generated token and returns its fitness if it is already known, but if not determines it based on:
its length
how often it appears in the text
and how "tokenizable" it already is. This means that the more tokens it takes to tokenize the subtext, the more the algorithm will prioritize this individual as it represents something very new to the tokenizer, making it more important to learn.
Crossover happens by combining random tokens, and finding them in the source text, and then converting that back into a new range.
The best individuals from every iteration are added to the tokenizer.
This approach is much faster and can produce several thousand tokens from the Bible in about 10 seconds. I still need to do more testing to make sure the tokens are actually useful, but it generally has a lot of overlap as the tokens generated by other tokenizers, but the genetic tokenizer is much faster.
Most tokenizers seem to be:
Brute force algorithms that try to incrementally find the next most common "token".
The genetic tokenizer probably works well because common tokens are more likely to be randomly picked, by nature than they are common.
I'm getting this integrated into the main extension right now but got sidetracked by errors when having multiple vector databases open at the same time. Hopefully, I'll be able to push before Wednesday. This idea still has a long way to go, but I think it might be helpful, we'll see.
Most tokenizers train very slowly on even moderate amounts of text. I've started building a tokenizer that employs genetic algorithms to achieve the same results but faster:
The general idea is like this:
Split text into large chunks
Pick many random sub-portions of this text between 1 and 8 characters long, represent these portions as integer ranges and place them as individuals in a genetic algorithm.
Mutate the ranges by shifting them left or right, shrinking, or stretching them.
Assign a "fitness" score to each using this
fitness function
:Essentially this takes the randomly generated token and returns its fitness if it is already known, but if not determines it based on:
its length
how often it appears in the text
and how "tokenizable" it already is. This means that the more tokens it takes to tokenize the subtext, the more the algorithm will prioritize this individual as it represents something very new to the tokenizer, making it more important to learn.
Crossover happens by combining random tokens, and finding them in the source text, and then converting that back into a new range.
The best individuals from every iteration are added to the tokenizer.
This approach is much faster and can produce several thousand tokens from the Bible in about 10 seconds. I still need to do more testing to make sure the tokens are actually useful, but it generally has a lot of overlap as the tokens generated by other tokenizers, but the genetic tokenizer is much faster.
Most tokenizers seem to be:
I'm getting this integrated into the main extension right now but got sidetracked by errors when having multiple vector databases open at the same time. Hopefully, I'll be able to push before Wednesday. This idea still has a long way to go, but I think it might be helpful, we'll see.