Closed Mathmagician8191 closed 1 year ago
Everything should be fixed now except for not having any tests
There is now a test verifying both encoding and decoding of the main test string from the original implementation
Let me know what do you think about this, and I'll merge.
The long test string should be fine, the test isn't likely to be run that often and it makes sure all edge cases are handled
Tokenizer implementation taken from https://github.com/BlinkDL/ChatRWKV/tree/main/tokenizer with test code removed.
This pull request adds a tokenizer command line argument to chat_with_bot.py, generate_completions.py and measure_pexplexity.py. Current options are the original 20B tokenizer (default) and the new world tokenizer.