Correct handling of unsupported characters (european and asian languages should work now)
Added input conversion from UTF 16 to UTF 8 for Windows
Added a workaround for the tokenizer to merge triplet tokens instead that can not form vocabularized duplets
this is not a complete solution but it covers a lot of cases of wrong tokenization given falcons larger vocabulary
Correct handling of unsupported characters (european and asian languages should work now) Added input conversion from UTF 16 to UTF 8 for Windows Added a workaround for the tokenizer to merge triplet tokens instead that can not form vocabularized duplets