Tokenizing Phonetics - Githubissues

HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training

BSD 2-Clause "Simplified" License

45 stars 6 forks source link

Tokenizing Phonetics #10

Open ClashLuke opened 2 years ago

ClashLuke commented 2 years ago

Currently, all tokenisers work on a character level. This means that transferring them to a new language is often not possible. At the same time, this means that a model trained with such a tokeniser is specific for that particular language and won't be able to transfer from Spanish to Italian without significant effort. Additionally, written language is a quantised form of speech to reduce the space you need to store it. However, this conversion is very lossy, as it doesn't contain sarcasm or other vocal information.\ We hope to reduce the first issue by using phonetic information while leaving the second untouched. The second could be solved by #9, although that uses less sparsity and therefore needs a bigger context to encode the same information.\ This issue tracks the progress of implementing such a tokeniser built on phonetic information and the resulting language model trained with it.

buttercutter commented 2 years ago

@ClashLuke How about using byte-level BPE ?

ClashLuke commented 2 years ago

That would be a perfect combination! Do you want to give it a try?

buttercutter commented 2 years ago

I am still checking with the original facebook authors on some of the technical details about BBPE.

Once I understand the BBPE mechanism, I can help implement this for your project, but I supposed you are already well-versed in BBPE tokenizer ?

buttercutter commented 2 years ago

I managed to understand the rationale behind their dynamic programming in equation (1) for BBPE.

However, I am still checking as in why would the BBPE output be the same between 4K and 32K.

ClashLuke commented 2 years ago

It's a bit hard to see, but the outputs do differ. Look at the bytes of the Japanese tokens and the spaces between them.

buttercutter commented 2 years ago

yup, it is hard to see the extra whitespace. However, it is confusing as in how would an extra whitespace be introduced with the use of dynamic programming equation (1). Any idea ?

Besides, the equation might needs to be adapted to limit t to maximum of 3 instead of 4 given that chinese characters are encoded with just 3 bytes, as I found. Please correct me if wrong.

ClashLuke commented 2 years ago

It's not about Chinese characters; it's about UTF-8, which can be made up of up to 4 bytes. Please look at its Wikipedia entry, as it explains the encoding quite well.