vocabulary_path not work properly with support_prior_joiners ?

Zenglinxiao commented 3 years ago

Hello Guillaume, I came into an issue when using vocabulary_path. Normally, with the use of vocabulary_path, we would expect the output sentence does not contain vocab below a certain threshold as a separate token. But once this feature is used on pre-segmentation text and use support_prior_joiners, this can not be guaranteed. Example:

>>> tokenizer = pyonmttok.Tokenizer("space", joiner_annotate=True, support_prior_joiners=True, bpe_model_path="bpe.code")
>>> tokenizer.tokenize_file("corpora", "corpora.tok")
>>> get_vocab("corpora.tok", "vocab_of_tokenized_file")
>>> tokenizer_threshold = pyonmttok.Tokenizer("space", joiner_annotate=True, support_prior_joiners=True, bpe_model_path="bpe.code", vocabulary_path="vocab_of_tokenized_file", vocabulary_threshold=13)
>>> tokenizer.tokenize("目前￭ ,￭ 美国￭ 进出口￭ 分别￭ 占￭ GDP￭ 的￭ 15%￭ 和￭ 12 % 。")[0]
['目前￭', ',￭', '美国￭', '进￭', '出口￭', '分别￭', '占￭', 'GDP￭', '的￭', '15￭', '%￭', '和￭', '1￭', '2', '%', '。']
>>> tokenizer_threshold.tokenize("目前￭ ,￭ 美国￭ 进出口￭ 分别￭ 占￭ GDP￭ 的￭ 15%￭ 和￭ 12 % 。")[0]
['目￭', '前￭', ',￭', '美￭', '国￭', '进￭', '出￭', '口￭', '分￭', '别￭', '占￭', 'G￭', 'D￭', 'P￭', '的￭', '15￭', '%￭', '和￭', '1￭', '2', '%', '。']

Notice that: only 出口￭ is below frequency 13 and should be unmerged. Furthermore, by reiterating corpora tokenized by tokenizer of threshold 2, the vocabulary size decrease from 3700->2221, while there are 3041 token > 2 in original vocabulary. Any idea?

guillaumekln commented 3 years ago

Hi,

Just to make sure I understood correctly: do you mean that tokenizer_threshold.tokenize should output these tokens?

['目前￭', ',￭', '美国￭', '进￭', '出￭', '口￭', '分别￭', '占￭', 'GDP￭', '的￭', '15￭', '%￭', '和￭', '1￭', '2', '%', '。']

(where only 出口￭ is further split into 出￭口￭)

If yes, what is the format of the vocabulary file vocab_of_tokenized_file? In the latest released version, the frequencies should be separated by a space.

Zenglinxiao commented 3 years ago

Hi,

Just to make sure I understood correctly: do you mean that tokenizer_threshold.tokenize should output these tokens?
['目前￭', ',￭', '美国￭', '进￭', '出￭', '口￭', '分别￭', '占￭', 'GDP￭', '的￭', '15￭', '%￭', '和￭', '1￭', '2', '%', '。']
(where only 出口￭ is further split into 出￭口￭)

Yes, exactly.

If yes, what is the format of the vocabulary file vocab_of_tokenized_file? In the latest released version, the frequencies should be separated by a space.

vocab_of_tokenized_file is got by using get_vocab.py from subword_nmt, so <token><space><frequence>

I can confirm the vocab loading is correct, as I also did this on the English side without applying pre-segmentation but the same options, and its output is just as expected.

guillaumekln commented 3 years ago

Thanks for reporting. I confirm this is a bug.

OpenNMT / Tokenizer

vocabulary_path not work properly with support_prior_joiners ? #177