Closed Zenglinxiao closed 3 years ago
Hi,
Just to make sure I understood correctly: do you mean that tokenizer_threshold.tokenize
should output these tokens?
['目前■', ',■', '美国■', '进■', '出■', '口■', '分别■', '占■', 'GDP■', '的■', '15■', '%■', '和■', '1■', '2', '%', '。']
(where only 出口■
is further split into 出■ 口■
)
If yes, what is the format of the vocabulary file vocab_of_tokenized_file
? In the latest released version, the frequencies should be separated by a space.
Hi,
Just to make sure I understood correctly: do you mean that
tokenizer_threshold.tokenize
should output these tokens?['目前■', ',■', '美国■', '进■', '出■', '口■', '分别■', '占■', 'GDP■', '的■', '15■', '%■', '和■', '1■', '2', '%', '。']
(where only
出口■
is further split into出■ 口■
)
Yes, exactly.
If yes, what is the format of the vocabulary file
vocab_of_tokenized_file
? In the latest released version, the frequencies should be separated by a space.
vocab_of_tokenized_file
is got by using get_vocab.py
from subword_nmt
, so <token><space><frequence>
I can confirm the vocab loading is correct, as I also did this on the English side without applying pre-segmentation but the same options, and its output is just as expected.
Thanks for reporting. I confirm this is a bug.
Hello Guillaume, I came into an issue when using
vocabulary_path
. Normally, with the use ofvocabulary_path
, we would expect the output sentence does not contain vocab below a certain threshold as a separate token. But once this feature is used on pre-segmentation text and usesupport_prior_joiners
, this can not be guaranteed. Example:Notice that: only
出口■
is below frequency 13 and should be unmerged. Furthermore, by reiterating corpora tokenized by tokenizer of threshold 2, the vocabulary size decrease from 3700->2221, while there are 3041 token > 2 in original vocabulary. Any idea?