OpenNMT / Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support
https://opennmt.net/
MIT License
276 stars 69 forks source link

vocabulary_path not work properly with support_prior_joiners ? #177

Closed Zenglinxiao closed 3 years ago

Zenglinxiao commented 3 years ago

Hello Guillaume, I came into an issue when using vocabulary_path. Normally, with the use of vocabulary_path, we would expect the output sentence does not contain vocab below a certain threshold as a separate token. But once this feature is used on pre-segmentation text and use support_prior_joiners, this can not be guaranteed. Example:

>>> tokenizer = pyonmttok.Tokenizer("space", joiner_annotate=True, support_prior_joiners=True, bpe_model_path="bpe.code")
>>> tokenizer.tokenize_file("corpora", "corpora.tok")
>>> get_vocab("corpora.tok", "vocab_of_tokenized_file")
>>> tokenizer_threshold = pyonmttok.Tokenizer("space", joiner_annotate=True, support_prior_joiners=True, bpe_model_path="bpe.code", vocabulary_path="vocab_of_tokenized_file", vocabulary_threshold=13)
>>> tokenizer.tokenize("目前■ ,■ 美国■ 进出口■ 分别■ 占■ GDP■ 的■ 15%■ 和■ 12 % 。")[0]
['目前■', ',■', '美国■', '进■', '出口■', '分别■', '占■', 'GDP■', '的■', '15■', '%■', '和■', '1■', '2', '%', '。']
>>> tokenizer_threshold.tokenize("目前■ ,■ 美国■ 进出口■ 分别■ 占■ GDP■ 的■ 15%■ 和■ 12 % 。")[0]
['目■', '前■', ',■', '美■', '国■', '进■', '出■', '口■', '分■', '别■', '占■', 'G■', 'D■', 'P■', '的■', '15■', '%■', '和■', '1■', '2', '%', '。']

Notice that: only 出口■ is below frequency 13 and should be unmerged. Furthermore, by reiterating corpora tokenized by tokenizer of threshold 2, the vocabulary size decrease from 3700->2221, while there are 3041 token > 2 in original vocabulary. Any idea?

guillaumekln commented 3 years ago

Hi,

Just to make sure I understood correctly: do you mean that tokenizer_threshold.tokenize should output these tokens?

['目前■', ',■', '美国■', '进■', '出■', '口■', '分别■', '占■', 'GDP■', '的■', '15■', '%■', '和■', '1■', '2', '%', '。']

(where only 出口■ is further split into 出■ 口■)

If yes, what is the format of the vocabulary file vocab_of_tokenized_file? In the latest released version, the frequencies should be separated by a space.

Zenglinxiao commented 3 years ago

Hi,

Just to make sure I understood correctly: do you mean that tokenizer_threshold.tokenize should output these tokens?

['目前■', ',■', '美国■', '进■', '出■', '口■', '分别■', '占■', 'GDP■', '的■', '15■', '%■', '和■', '1■', '2', '%', '。']

(where only 出口■ is further split into 出■ 口■)

Yes, exactly.

If yes, what is the format of the vocabulary file vocab_of_tokenized_file? In the latest released version, the frequencies should be separated by a space.

vocab_of_tokenized_file is got by using get_vocab.py from subword_nmt, so <token><space><frequence>

I can confirm the vocab loading is correct, as I also did this on the English side without applying pre-segmentation but the same options, and its output is just as expected.

guillaumekln commented 3 years ago

Thanks for reporting. I confirm this is a bug.