alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Spacecode: extend Capcode idea to composite words #10

Closed kosiakk closed 1 year ago

kosiakk commented 1 year ago

I have a suggestion to discuss that could enhance the tokenization in your already amazing approach. I wonder if there's a benefit to consistently pushing all spaces to the front (similar to what OpenAI does), end, or using some other strategy.

Currently, I don't see a specific strategy in english-100256-capcode. The patterns seem to stem from the statistical properties of the corpus:

Format Count of tokens (after trim) Count of tokens (unmodified)
word 47380 47125
word+space 45516 48173
space+word 1447 2652
space+word+space 1447 2050
other 4466 256

The difference between columns is subtle, and appears with multi-space tokens:

There is a noticeable overlap between formats. We can also count how many forms form each trimmed word has:

Forms 1 2 3 4
Words 68897 11286 471 727

68897 (69%) of all tokens are alright. They might have spaces today, but at least there is exactly one version. If we address this, we can save 2 11286 + 3 471 + 4 * 727 = 26893 = 27% of token ids and reuse them for something else. I also believe it might help you with performance, because some tokens will be combined early in the process.

Extending Capcode to Composite Words

Capcode does a great job at treating Greedy• and greedy• as the same token (40734). However, issues can arise when considering alternate spellings, such as •greedy. By extending the Capcode idea to composite words, we could address these concerns.

What about •greedy? If such token can possibly appear, it would introduce a near-duplicate. Currently there are no alternative spellings, so greedy+space is the only token. Can it be used at the end of the sentence? greedy. Maybe it is exactly the use-case you foresee with delete token?

•and• is 18564, and• 13207, there is no •and

Proposal

All text tokens should have no spaces at the start or end. Punctuation tokens would remain unchanged.

In this approach, TokenMonster and ungreedy would be spelled using a version of zero-width joiner or word joiner from Unicode:

TokenMonster demo

Token+Monster is an un+greedy tokenizer and vocabulary builder !

Or, with Capcode:

^token+^monster is an un+greedy tokenizer and vocabulary builder !

This extension could reduce the number of tokens used by 27% and repurpose them to bring more semantically meaningful words into the vocabulary. With more known individual words, there will be even less need to use concatenation token.

I believe implementing this idea would make the tokenization process more efficient and consistent. Looking forward to hearing your thoughts!

kosiakk commented 1 year ago

Practically speaking, there will be two types of tokens: words and punctuation. We can also call them alphabetic and symbolic.

Symbolic is spelled precisely, and preserves all the whitespace: !, !, and ! are all different tokens, if optimizer decides so. No additional whitespace is added before or after symbols. I'd guess, Chinese hieroglyphs would also fall into this category, because they don't use spaces or capitalization.

Alphabetic tokens are subject to Capcode and the proposed Spacecode. Tokenizer expects one space between pairs of alphabetic tokens, and token-to-string decoder would insert them by default.

The default case is one space between words. If two words are unconventionally separated with zero, two, three or more spaces, there will be special tokens between them:

Input Tokens Comment
helloworld hello+world Unusual concatenation requires additional tokens
hello world hello world Most common default case is just 2 tokens
hello world hello `world` Two spaces
hello world hello `world` Three spaces
hello, world hello , world Symbolic token contains all the spaces it needs

Just like in Capcode, we might foresee modifier tokens:

My guess, is that it would improve compression, and semantic clarity of most of English texts.

alasdairforsythe commented 1 year ago

Hi, thanks for writing that detailed and well thought out suggestion.

I feel a little bad that you wrote that out so nicely and my answer is "it's already done"... in the new version, which I will release this week.

What I did is implement a 'D' deleteToken. In simple terms, any sequence of letters or numbers that are not already preceded by a space have "D " inserted in front. That means all words now consistently begin with a space. The tokenizer can optimize as it likes, but I'm nudging it by giving it the rule that it should side with optimizing towards "space letter" or "space number" beginnings whenever there are 2 or more branches of equal weight.

I've also added 4 modes: 0 = maximize chr/tok, 1 = balanced, 2 = consistent, 3 = maximize consistency (required) Anything greater than 0 will trim tokens that don't follow certain rules. At maximum consistency it'll essentially be forced to have only 1 token for the same word, instead of allowing it to be merged in with punctuation, etc.

I'll have a think about whether a joiner will be useful. It's past midnight and my brain is refusing to think properly, so I'll have a think about it tomorrow.

alasdairforsythe commented 1 year ago

The issue with all methods is what to do with words made of multiple tokens. The current version prefers to tokenize with a space after words. I would have liked to go with that and implement with a backspace token, but the issue there is that it will mess with the decoding since a future token could affect a previous token. That's why I had to go with a forward delete.

By assuming a space and adding tokens for non-space, it would mean (almost) doubling the number of tokens required to represent multi-token words. The D decodeToken adds the opportunity for consistency but the optimizer doesn't have to take it, and it doesn't always (I can see in my tests.) Partly because tokens that don't begin with a space can be used as parts of longer words as well as in individual words.

The current version method of letting it choose anything has the issue that it produces, as you showed, multiple versions of words. That makes take longer (require more training) for a small language model to learn. My solution is to offer those different "modes" so the user can choose whether they prefer compression or consistency. But it would be nice if there were a one size fits all solution.

alasdairforsythe commented 1 year ago

It seems to be like this:

It can be normalized so there are spaces:

The most optimal will be either before or after, but not both or neither because there is only one space between words.

The solution then is to normalize for one direction. This is me thinking out loud. I do wish there were a neat solution to this that enables there to be one version of a word without requiring additional tokens to either join or space them. The level 3 mode will result it separate word/prefix and suffix tokens, which is not great for compression but it will be easy to learn for the language model.

kosiakk commented 1 year ago

If you assume a space, you need a token to undo it If you don't assume it, you need a token to add it

Exactly! The idea is to assume a space (between pairs of specific tokens) and utilize a dedicated token to remove it when necessary.

normalize for one direction

OpenAI adopted a similar approach: hello world,world is encoded as hello, world, ,, world, where the two "world" tokens are completely different because of the space. On top of that, World and World are two more tokens, resulting in 4 distinct tokens for what humans might consider single word, just placed differently.

helloworld becomes hell ow orld, which will make German language much harder to tokenize (they omit spaces quite often!)

always before

Yes, OpenAI uses the "space-ish" symbol Ġ, representing whitespace or word boundaries, stored in the vocab.json. This symbol is then cleaned up during decoding.

always before, after, never, or both because there is only one space between words

Only never looks reasonable to me - space is just not a part of the word! It's a technical detail, which must be handled by the tokenizer. All programming languages agree with this view - they discard whitespace and comments as soon as possible (Python uses indentation level to replace curly braces, but still ignores spaces around =). Decoding is similar to automatic code formatting - it would insert spaces appropriately and automatically.

Of course, we need exceptions for written natural languages. That's why new line and two spaces tokens exist, and that's why I'm proposing no space token as well.

My motivation is not the zip-style compression, but rather enhancing the AI model's efficiency by maintaining the semantic relationship between tokens.

kosiakk commented 1 year ago

Shall we run an experiment?

  1. Tokenize a corpus with a decent vocab.
  2. Patch the vocab by trimming all the spaces from alphabetic tokens (and remove collisions).
  3. Redo the same corpus. Current tokenizer will be forced to insert between all words, because none of trimmed tokens would have a space before or after.
  4. Simulate the proposal: 4.1. How many single-space tokens would be removed (assumed space between words by default) 4.2. How many word-joiner tokens would be added (concatenated word pieces)

This gives us the lower estimate of the possible gains. In practice, the improvement will be better because patched vocab will be ¼ smaller.

alasdairforsythe commented 1 year ago

I appreciate your input and I'd be happy for the collaboration.

In terms of compression it won't be better, because the vocabularies are already optimal. If you run the training multiple times it generates the same vocabulary, give or take a few of the borderline tokens.

What makes a difference to the final vocabulary is changing the tokenization algorithm greediness, and the normalization in some cases, and of course the vocabulary size.

I'm saying that to lead up to an important point here, which is that 100256 is a whole different beast than, let's say 32000. 100256 is so big it has spare tokens and the optimizer has plenty of available tokens for making different versions of words merged with punctuation, or with spaces on different sides. That's not true of 32000. At 32000, it's feeling the pressure.

So any analysis you're doing is best done on 32000, which is the ideal one-fits-all model size in my opinion.

To implement your suggestion, it's best done on the tokens that come out of getalltokens before the training. Otherwise you're comparing two different vocabulary sizes. However it might be quite similar to the difference between mode 0 and mode 3 in the new version, I'd suggest you wait for that first.

Another issue with a merge token is that it adds a lot of complexity to the tokenization. It'd be massively more complicated and run several times slower. That's because it needs to check for not just the next token but the possibility of a merge in various positions. The easiest/most-efficient way to implement that would be to add the spaces to the tokens and dataset during training and then swap that over at the end, but it's just the same as using extra space and delete as I'm already doing (but the other way around.)

kosiakk commented 1 year ago

Ok, let's wait for your upcoming results with the delete token!

I've checked all available sizes of English:

Vocab Alphabetic ...of Vocab Distinct after trim Duplicates ...of Distinct ...of Vocab
100256 67569 67% 55579 11990 22% 12%
65536 45148 69% 36969 8179 22% 12%
50256 35129 70% 28620 6509 23% 13%
32000 22896 72% 18509 4387 24% 14%
24000 17499 73% 14106 3393 24% 14%

Rules:

Fractions remain consistent for different vocab sizes. If anything, I'd say tokenizer can be improved even more for smaller vocab sizes.

the vocabularies are already optimal

Yes, certainly. I also don't suggest to change any rules of optimizer.

changing the tokenization algorithm greediness, and the normalization

Yes, exactly that. It feels like the whitespace normalization could improve compression: a quarter of alphabetic tokens can be replaced with a new rule in tokenizer, and the TokenMonster could nom-nom-nom 12-14% more tokens for other needs. Given new tokenizer rule, the vocab optimizer will be affected and could do even better job.

The specific way to do the space normalization seems different in our minds, but I agree - eventually it must lead to the same result. I'm looking forward :-) Thank you for a great work!

alasdairforsythe commented 1 year ago

This coming update (it's complete, except for waiting on the vocabularies being built and the Python implementation), is a major update. All the vocabularies now prefer to split words with space in front, and I'm seeing at least 10% gains as well as increased consistency. The "balanced" vocabularies actually perform better on English than the "compressed" vocabularies, which appears to be because it reduces the amount of tokens that are being allocated to various different combinations of words and symbols with spaces, tabs and newlines due to a significant part of the English dataset being code. Obviously that also increases token count for code significantly but it's much more consistent (and that's optional anyway.)

I've also sped up the speed of tokenization by 15x on Go, and 60x on Python. It runs about as fast as it can loop through the text and do the branching logic for 3 branches, the actual finding of tokens is optimized away. So this new version will be several times faster than tiktoken.

alasdairforsythe commented 1 year ago

The new version (1.0) is finally released. No prebuilt vocabularies yet though, the first few of those will be up tomorrow.

kosiakk commented 1 year ago

That's brilliant, thank you! I can't wait to see the new vocab :-)