Tokenizer ggcc - Githubissues

This push addresses a serious difference in the tokenization, the one we used was the sentencepiece llama tokenizer. It is not able to tokenizer for Falcon, almost any long word was broken up (including special words) and international content was almost always complete garbage afterward. This is a serious issue to model quality, so it had to go.

This is a large change, I've spent 20+ hours first reversing what BPE in GPT-2 regex mode really does, then reimplementing it in C++ without the use of regular expressions and unicode libraries. Could have used regexp but it felt wrong to add such a chunky slowness into it. The whole task felt like reinventing the wheel while getting punched constantly. So I made a slim unicode library to work with and identify unicode data, bpe-decoders and -encoders similar to the special GPT2-BPE ones, a lookahead parser that should perform identical to the GPT/Falcon regular expression: 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+.

In addition I created a new magic and file version for Falcon ggllm.cpp called "GGCC" V0, there is no way around that now as the falcon tokenizer requires 65,000 merging sub-token pairs to operate. The model loader supports two modes: 1) old compatibility mode (also for quantizing/converting to GGCC) - you need to have the original tokenizer.json file in the same directory as the model binary. The original falcon-7b tokenizer.json should work for all finetunes because they do not change bpe merging. I did not integrate it into the python code, it's still creating llama ggml V1 to be converted. The json parser I wrote is absolute minimal, it expects the json pretty-print format from HF models. 2) The new model binary has the required data integrated, so no more additional files from there on.

Updating By using the updated falcon quantizer you will convert the old model (best to start from 32 bit, 16 bit should be fine too) into a GGCC model (file version 10). You will need the falcon 40b tokenizer.json in the model directory. https://huggingface.co/tiiuae/falcon-7b/blob/main/tokenizer.json

The new solution will need more tests, I am merging it anyway so this moves ahead. I was not able to find any differences between this one and the transformers tokenizer, though given the scale of this addition there can be issues remaining, in worst case crashes. I ran the tokenizer on a couple random files and that worked fine.

Test input 1:

"A Nuskhuri abbreviation of იესუ ქრისტე ( iesu kriste ) ' Jesus Christ '"
Result before:
    44 -> 'A'
 19836 -> ' Nu'
  4282 -> 'sk'
 11481 -> 'hu'
   365 -> 'ri'
 44323 -> ' abbrevi'
   326 -> 'ation'
   275 -> ' of'
   204 -> ' '
   228 -> '�'
   134 -> '�'
   155 -> '�'
   228 -> '�'
   134 -> '�'
   151 -> '�'
   228 -> '�'
   134 -> '�'
   164 -> '�'
   228 -> '�'
   134 -> '�'
   166 -> '�'
   204 -> ' '
   228 -> '�'
   134 -> '�'
   168 -> '�'
   228 -> '�'
   134 -> '�'
   163 -> '�'
   228 -> '�'
   134 -> '�'
   155 -> '�'
   228 -> '�'
   134 -> '�'
   164 -> '�'
   228 -> '�'
   134 -> '�'
   165 -> '�'
   228 -> '�'
   134 -> '�'
   151 -> '�'
   204 -> ' '
    19 -> '('
 30139 -> ' ie'
  5243 -> 'su'
 17357 -> ' kr'
  9024 -> 'iste'
   204 -> ' '
    20 -> ')'
   204 -> ' '
    18 -> '''
 22464 -> 'Jesus'
  1781 -> ' Christ'
    18 -> '''

Result now (identical to transformers tokenization):

    44 -> 'A'
   409 -> ' N'
   359 -> 'us'
 19092 -> 'kh'
  8596 -> 'uri'
 32899 -> ' abbre'
 48969 -> 'viation'
   275 -> ' of'
 27171 -> ' �'
   209 -> '�'
   230 -> '�'
 36891 -> '�'
   226 -> '�'
 36891 -> '�'
   106 -> '�'
 36891 -> '�'
   108 -> '�'
 27171 -> ' �'
   209 -> '�'
   110 -> '�'
 36891 -> '�'
   238 -> '�'
 36891 -> '�'
   230 -> '�'
 36891 -> '�'
   106 -> '�'
 36891 -> '�'
   107 -> '�'
 36891 -> '�'
   226 -> '�'
   204 -> ' '
    19 -> '('
   204 -> ' '
   424 -> 'ies'
    96 -> 'u'
   429 -> ' k'
  1431 -> 'rist'
    80 -> 'e'
   204 -> ' '
    20 -> ')'
   204 -> ' '
    18 -> '''
  4159 -> ' Jesus'
  1781 -> ' Christ'
   204 -> ' '
    18 -> '''

Handling of fine tuned custom special tokens: Before:

    39 -> '<'
   103 -> '|'
 18269 -> 'prom'
 15608 -> 'pte'
    93 -> 'r'
 54146 -> '|>'
  7282 -> 'Who'
   304 -> ' is'
 25286 -> ' smarter'
    23 -> ','
 23906 -> ' Newton'
   379 -> ' or'
 33264 -> ' Einstein'
 46938 -> '?<'
   103 -> '|'
  4853 -> 'endo'
 23346 -> 'fte'
   689 -> 'xt'
 54146 -> '|>'
    39 -> '<'
   103 -> '|'
 58791 -> 'assist'
   384 -> 'ant'
 54146 -> '|>'

New tokenizer:

main: number of tokens in prompt = 11
 65028 -> '<|prompter|>'
  7282 -> 'Who'
   304 -> ' is'
 25286 -> ' smarter'
    23 -> ','
 23906 -> ' Newton'
   379 -> ' or'
 33264 -> ' Einstein'
    42 -> '?'
    11 -> '<|endoftext|>'
 65025 -> '<|assistant|>'

cmp-nct / ggllm.cpp

Tokenizer ggcc #38