Tokenenizer fix 1 - Githubissues

Correct handling of unsupported characters (european and asian languages should work now) Added input conversion from UTF 16 to UTF 8 for Windows Added a workaround for the tokenizer to merge triplet tokens instead that can not form vocabularized duplets

this is not a complete solution but it covers a lot of cases of wrong tokenization given falcons larger vocabulary

cmp-nct / ggllm.cpp

Tokenenizer fix 1 #35