alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
MIT License
528 stars 20 forks source link

Meaning of C and D #15

Open Maxscha opened 11 months ago

Maxscha commented 11 months ago

Thanks for this amazing library. Looking forward to actually train and adapt some models for it.

After creating my first vocabulary I noticed that a lot of the tokens contain uppercase C and uppercase D. Do those have a special meaning? I could also see them referenced in the code, but I could not find the meaning.

Thanks in advance

Example:

tokens:
    - token:   "D"
      id:      35
      score:   0.006828829
      encoded: true
    - token:   " und"
      id:      2657
      score:   0.0047021606
      encoded: true
    - token:   " der"
      id:      2099
      score:   0.0032128973
      encoded: true
    - token:   "C"
      id:      34
      score:   0.0031624683
      encoded: true
    - token:   " die"
      id:      2105
      score:   0.002436903
      encoded: true
    - token:   " von"
      id:      2684
      score:   0.0021727835
      encoded: true
    - token:   ".C"
      id:      271
      score:   0.0020115946
      encoded: true
    - token:   " für"
      id:      5997
      score:   0.0017581019
      encoded: true
    - token:   "-DC"
      id:      1163
      score:   0.0017092729
      encoded: true
    - token:   " des"
      id:      2100
      score:   0.0016576286
      encoded: true
    - token:   " mit"
      id:      2407
      score:   0.0014818916
      encoded: true
    - token:   " in"
      id:      993
      score:   0.0014810717
      encoded: true
    - token:   ",C"
      id:      259
      score:   0.0014182056
      encoded: true
    - token:   ","
alasdairforsythe commented 11 months ago

D, C & W are 'capcode' markers for capcode level 2. With capcode level 1 it will instead use only ord(127). D means delete next space. C means uppercase next character. W means uppercase next word.