Add Tokenization - Githubissues

cfoster0 / CLAP

Contrastive Language-Audio Pretraining

BSD 3-Clause "New" or "Revised" License

87 stars 4 forks source link

Add Tokenization #3

Closed cfoster0 closed 3 years ago

cfoster0 commented 3 years ago

Tokenize the text either via chars or unigrams. Figure out most appropriate method here: ideally we want to be able to accommodate all English text, punctuation, emojis, and potentially other text.

cfoster0 commented 3 years ago

Using this probably wouldn't be a bad call https://github.com/huggingface/tokenizers

cfoster0 commented 3 years ago

Byte-level tokenization from UTF-8 encoded text seems like the easiest and most flexible option.

cfoster0 commented 3 years ago

No need to over engineer this: Python's native functions let you encode strings into UTF-8 bytes and cast bytes into integers (token indices), which is sufficient for our purposes.