Closed cfoster0 closed 3 years ago
Using this probably wouldn't be a bad call https://github.com/huggingface/tokenizers
Byte-level tokenization from UTF-8 encoded text seems like the easiest and most flexible option.
No need to over engineer this: Python's native functions let you encode strings into UTF-8 bytes and cast bytes into integers (token indices), which is sufficient for our purposes.
Tokenize the text either via chars or unigrams. Figure out most appropriate method here: ideally we want to be able to accommodate all English text, punctuation, emojis, and potentially other text.