VendenIX / tokenCounterChatGPT

Google-chrome & FIrefox extension that show the amount of tokens on your current prompt on ChatGPT
MIT License
0 stars 1 forks source link

Not a good approximation of the amount of tokens #3

Open VendenIX opened 9 months ago

VendenIX commented 9 months ago

I saw that 1 token is not equal to 1 word but 3 or 4 caracters.

https://platform.openai.com/tokenizer

I need to fix this by using this web site or by doing an API call or by approximation, one idea :

To approximate the tokenization algorithm without the specific details of the original algorithm, we can use heuristic and iterative approaches. Here’s a concise plan for GitHub documentation:

  1. Corpus Analysis: Analyze a large text corpus to identify the most frequent words and subwords.
  2. Rule Definition: Create rules for tokenizing words into subwords based on frequency. For example, common words remain whole, while less common words are split into smaller pieces.
  3. Tokenization Heuristics: Apply heuristics such as:
    • Tokenize punctuation and special characters first.
    • Split compound words and those with common prefixes/suffixes.
    • Use a dictionary of frequent words to tokenize whole words when possible.
  4. Iterative Improvement: Implement a feedback loop where you test the algorithm with known texts and adjust the rules to better match the actual token count.