How did the tokenizer learned?

THUDM / icetk

A unified tokenization tool for Images, Chinese and English.

150 stars 17 forks source link

Closed silverriver closed 1 year ago

silverriver commented 1 year ago

Could you provide more details about how these tokenizers are learned?

for example:

On what images/texts did the vqvae/sentencepiece tokenizers learned
For the text tokenizer, how to learn two vocab? Did you learn the Chinese and English vocab respectively and try to merge them? Does the sentencepiece lib provide such a feature?
Which mode did you use to build sentencepiece tokenizer? unigram or bpe or wordpiece?
Have you consider to use byte-level bpe (as GPT2) since it can handle almost all unicode tokens.

Sleepychord commented 1 year ago

The image embeddings are learned on CogView dataset, and the text tokenizer is learned on a subset of pile and wudaocorpus.
We separately train the two dictionary and merge them by some processing to prevent repetition. Jointly training will result in unbalance in our experiments. We use sentencepiece from google.
unigram
Unfortunately no.

silverriver commented 1 year ago

Thank you for your quick response. I have some followup questions regarding to the process used in merging the Chinese and English vocab.

How did you merge these vocab after removing the repetition? Did the sentencepiece lib provide such api? or you directly modify the internal state of the spm model?
How to estimate the probability associated with each token when merging the vocab if you did this merging process by youself. unigram model requires a probability assigned to each token in the vocab.

thank you very much in advance

Sleepychord commented 1 year ago

No, they don't have this api. I manually modify the internal state.
Just make the numbers of the Chinese and English tokens basically equal, which is what we want.

silverriver commented 1 year ago

I am wondering if it possible for you to share the script used to merge these vocabs?

Specifically, how to merge the trie in the English and Chinese spm?

Sleepychord commented 1 year ago

I am sorry, they are mostly ipython on-the-fly and I didn't saved them.

silverriver commented 1 year ago

Thank you for your replay