Thanks for this amazing library. Looking forward to actually train and adapt some models for it.
After creating my first vocabulary I noticed that a lot of the tokens contain uppercase C and uppercase D. Do those have a special meaning? I could also see them referenced in the code, but I could not find the meaning.
D, C & W are 'capcode' markers for capcode level 2. With capcode level 1 it will instead use only ord(127).
D means delete next space.
C means uppercase next character.
W means uppercase next word.
Thanks for this amazing library. Looking forward to actually train and adapt some models for it.
After creating my first vocabulary I noticed that a lot of the tokens contain uppercase C and uppercase D. Do those have a special meaning? I could also see them referenced in the code, but I could not find the meaning.
Thanks in advance
Example: