Open miaomiao1992 opened 1 month ago
The issue is that I have only implemented a simple tokenizer which supports Latin alphabet and filter every other character. The the resulting text corpus becomes: "\nJack\n\n\n32\n\n\n\n"
which is shorter than the context length leading to the underflow ...
The solution would be a more sophisticated tokenizer: Just allowing all Unicode characters would probably blow up the vocab size to much. Maybe I will implement a BPE-based tokenizer as used by GPT-2 in the future, which also supports arbitrary unicode characters
corpus.txt is following:
你是谁? 我是Jack。
你今年几岁? 我今年32岁。
你女儿是谁? 我女儿是小圆圆。