Not a good approximation of the amount of tokens

I saw that 1 token is not equal to 1 word but 3 or 4 caracters.

https://platform.openai.com/tokenizer

I need to fix this by using this web site or by doing an API call or by approximation, one idea :

To approximate the tokenization algorithm without the specific details of the original algorithm, we can use heuristic and iterative approaches. Here’s a concise plan for GitHub documentation:

Corpus Analysis: Analyze a large text corpus to identify the most frequent words and subwords.
Rule Definition: Create rules for tokenizing words into subwords based on frequency. For example, common words remain whole, while less common words are split into smaller pieces.
Tokenization Heuristics: Apply heuristics such as:
- Tokenize punctuation and special characters first.
- Split compound words and those with common prefixes/suffixes.
- Use a dictionary of frequent words to tokenize whole words when possible.
Iterative Improvement: Implement a feedback loop where you test the algorithm with known texts and adjust the rules to better match the actual token count.

VendenIX / tokenCounterChatGPT

Not a good approximation of the amount of tokens #3