I need to fix this by using this web site or by doing an API call or by approximation, one idea :
To approximate the tokenization algorithm without the specific details of the original algorithm, we can use heuristic and iterative approaches. Here’s a concise plan for GitHub documentation:
Corpus Analysis: Analyze a large text corpus to identify the most frequent words and subwords.
Rule Definition: Create rules for tokenizing words into subwords based on frequency. For example, common words remain whole, while less common words are split into smaller pieces.
Tokenization Heuristics: Apply heuristics such as:
Tokenize punctuation and special characters first.
Split compound words and those with common prefixes/suffixes.
Use a dictionary of frequent words to tokenize whole words when possible.
Iterative Improvement: Implement a feedback loop where you test the algorithm with known texts and adjust the rules to better match the actual token count.
I saw that 1 token is not equal to 1 word but 3 or 4 caracters.
https://platform.openai.com/tokenizer
I need to fix this by using this web site or by doing an API call or by approximation, one idea :
To approximate the tokenization algorithm without the specific details of the original algorithm, we can use heuristic and iterative approaches. Here’s a concise plan for GitHub documentation: