Open Zenglinxiao opened 3 years ago
In short SHOULD:
add_dummy_prefix
for EN, No for CJK;.ingest
, .learn
) when to combine it with other subword methods;aggressive
mode to fix vocab duplication issue cause by subword marker;case_markup
stuff helps in reducing vocab duplication (should be better than training truecaser/recaser stuff of Moses).And:
segment_alphabet_change
can be used, but its behavior may also possible be capture by subword model;preserve_segmented_tokens
, besides it should be suggest to use when using segment_*
;spacer_new
seems useless for now;To evaluate:
segment_numbers
Warning: spacer_new
has too much impact on the sequence length.
Do we know where the approx. 4.8k remaining duplicates in the vocab come from in EN/aggressive?
preserve_segmented_tokens
behavior is probably expected in your case: the 'store' is segmented with segment_alphabet_change
, hence the spacer is separated.
Do not use shared SentencePiece model between Latin alphabet languages(EN/FR/DE/...) and Chinese alphabet languages(ZH/JA/...);
What would be the best setup then?
Warning:
spacer_new
has too much impact on the sequence length.
Yes. And It does not change the vocabulary. All it does is just making spacer a separate character. Hence useless.
preserve_segmented_tokens
behavior is probably expected in your case: the 'store' is segmented withsegment_alphabet_change
, hence the spacer is separated.
Accroding to option describtion page, it claims to not attach joiner/spacer to token segmented by segment_*
option. The example I showed above, we segment actually the 的
from the _store的
. The _store
is actually created/segmented before segment_*
step in the .tokenize
. Below showed the same example disable all segment_*
options.
>>> pyonmttok.Tokenizer("aggressive", spacer_annotate=True, case_markup=True, soft_case_regions=True, segment_alphabet_change=False, preserve_placeholders=True, preserve_segmented_tokens=False, segment_numbers=False).tokenize("我很喜欢apple store的2020款ipad。")[0]
['我很喜欢apple', '▁store的', '2020', '款ipad', '。']
The ideal segmentation should be:
_store的
-> ['_store', '的']
store的
->['store', '■', '的']
Do not use shared SentencePiece model between Latin alphabet languages(EN/FR/DE/...) and Chinese alphabet languages(ZH/JA/...);
What would be the best setup then?
As least for SentencePiece combination:
add_dummy_prefix
for spaced languages and no add_dummy_prefix
for ZH/JA.aggressive
to reduce duplicate vocab, case annonations to fold case.segment_alphabet_change
, preserve_placeholders
segment_numbers
, preserve_segmented_tokens
.Do we know where the approx. 4.8k remaining duplicates in the vocab come from in EN/aggressive?
4.8k indicate potential duplication, not all of them are duplicate. Because we consider a duplicate when there is ▁token
and token
, sometime token
is a word but can also be part of word, where I think for the majority of considered "duplication" is in this case.
Besides, some pattern indeed resulting duplication, a most common one is preceding special characters:
that
/▁that
by (that is not the case today)
international
/▁international
: 'international board, '
country
/▁country
: cross-country
Other pattern may exist, but not find so far.
Another issue should mention is that, placeholder will always be an issue and may resulting duplicate vocab when existing in corpus.
Removing all placeholder when learn_subword
may be benefical.
Consider the following example:
⦅_tgt_zh⦆ 我很喜欢apple store的2020款 ⦅_item_⦆ 。
We have ⦅_tgt_zh⦆
and ⦅_item_⦆
in the sentence, if we feed these into learn_subword
, we may have ▁我
, ▁。
in the subword vocab.
Then sentence will then be segmented as:
['⦅_tgt_zh⦆', '▁我', '很喜欢', 'apple', '▁store', '的', '2020', '款', '▁', '⦅_item_⦆', '▁。']
If we remove these stuff (origin format including space) from sentence when learn_subword
: 我很喜欢apple store的2020款 。
Then sentence will be segmented as:
['⦅_tgt_zh⦆', '▁', '我', '很喜欢', 'apple', '▁store', '的', '2020', '款', '▁', '⦅_item_⦆', '▁', '。']
⦅_tgt_zh⦆ I really like apple store's 2020 edition ⦅_item_⦆ .
-> w placeholder: ["⦅_tgt_zh⦆", "▁I", "▁really", "▁like", "▁apple", "▁store", "'", "s", "▁2020", "▁edition", "▁", "⦅_item_⦆", "▁."]
-> w/o placeholder ["⦅_tgt_zh⦆", "▁I", "▁really", "▁like", "▁apple", "▁store", "'", "s", "▁2020", "▁edition", "▁", "⦅_item_⦆", "▁", ".""]
Then iterating corpus tokenized by model trained w/o placeholder to get vocab for embeddings do not introduce duplicated vocab we've seen before.
segment_numbers
helps?Transformer based on non-share pyonmtok(agg) + SentencePiece [48k|48k]
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.56 | 26.69 | 24.93 | 29.87 |
10000 | 28.87 | 28.91 | 28.06 | 32.93 |
15000 | 29.88 | 29.82 | 29.2 | 33.99 |
20000 | 30.19 | 30.11 | 29.28 | 34.45 |
25000 | 30.24 | 30.45 | 29.35 | 34.73 |
30000 | 30.48 | 30.64 | 29.45 | 34.92 |
35000 | 30.45 | 30.67 | 29.28 | 35.04 |
40000 | 30.53 | 30.66 | 29.29 | 35.0 |
Transformer based on non-share pyonmtok(agg) + SentencePiece [48k|48k] with segment_numbers
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.26 | 26.61 | 24.84 | 30.13 |
10000 | 29.02 | 29.17 | 28.06 | 33.08 |
15000 | 29.77 | 29.65 | 28.88 | 33.94 |
20000 | 30.39 | 30.1 | 29.48 | 34.4 |
25000 | 30.53 | 30.45 | 29.61 | 34.64 |
30000 | 30.86 | 30.55 | 29.45 | 34.59 |
35000 | 30.69 | 30.74 | 29.36 | 34.81 |
40000 | 30.65 | 30.63 | 29.43 | 34.86 |
Conclusion: 2/4 +(~0.1), 1 ~ (<0.1), 1 -(>0.1). Basically we can use it, but do not expect too much.
split_digits
option offered by SentencePiece model training options.Transformer based on non-share pyonmtok(none) + SentencePiece [48k|48k]
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.81 | 26.99 | 25.49 | 30.05 |
10000 | 29.27 | 29.27 | 28.61 | 33.07 |
15000 | 29.99 | 30.05 | 29.43 | 33.82 |
20000 | 30.54 | 30.43 | 29.93 | 34.46 |
25000 | 30.7 | 30.57 | 30.02 | 34.63 |
30000 | 30.81 | 30.72 | 30.21 | 34.79 |
35000 | 30.68 | 30.74 | 30.06 | 34.73 |
40000 | 30.67 | 30.71 | 29.83 | 34.8 |
Transformer based on share pyonmtok(none) + SentencePiece [48k]
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.02 | 26.32 | 24.53 | 29.25 |
10000 | 28.87 | 29.24 | 28.54 | 32.62 |
15000 | 29.92 | 29.93 | 29.61 | 33.58 |
20000 | 30.51 | 30.76 | 29.9 | 34.42 |
25000 | 30.52 | 30.85 | 30.15 | 34.61 |
30000 | 30.81 | 31.04 | 30.33 | 34.74 |
35000 | 30.78 | 31.12 | 30.28 | 34.99 |
40000 | 30.85 | 31.02 | 30.19 | 35.09 |
Conclusion: non-share indeed made a difference at the beginning, but its advantage fast vanishing and surpassed by its shared version. The shared model got +0.2~+0.4 at 40k step compare to non-shared.
Aggressive or None? If we look at the experiments non-share above, we can see from table 1/2 & 3. That 3 out of 4 testsets, these two methods give similar results, while the other (test2019), none mode got better BLEU score than aggressive by a margin of 0.4. Therefore, even the aggressive mode reduce vocab duplication, none mode is still preferred when using the SentencePiece Unigram model.
Conservative or None? Following shows, the model trained with conservative mode: share pyonmtok(conservative) + SentencePiece [48k]
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 25.76 | 26.04 | 24.41 | 29.12 |
10000 | 28.67 | 28.71 | 28.29 | 32.3 |
15000 | 29.43 | 29.69 | 29.08 | 33.36 |
20000 | 30.01 | 30.1 | 29.59 | 34.05 |
25000 | 30.28 | 30.33 | 29.97 | 34.49 |
30000 | 30.27 | 30.58 | 29.99 | 34.69 |
35000 | 30.47 | 30.56 | 29.94 | 34.59 |
40000 | 30.52 | 30.54 | 29.84 | 34.72 |
Conclusion: share model with none
setting get better results than conservative, for all testsets.
--split_by_unicode_script
, --split_by_number
, --split_by_whitespace
which serve some objectives similar as conservative
/aggressive
.The issue with SentencePiece: Do not handle UNK by default
subword_nmt
.The solution to this according to the author: TO BE VERIFY
--byte_fallback
when training model which will split UNK into a byte sequence
Currently, we use SentencePiece in Tokenizer for our models contain ZH/JA in which no space serves as a natural word boundary. The SentencePiece model is applied after Tokenizer's
none
mode.node
mode will only split placeholder⦅ ⦆
from the sentence, and feed the list of resulting phrase into SentencePiece.The issue with joint SentencePiece model for European & CJK Languages
We share src and tgt to train a unified SentencePiece Model causing confilct for:
Option
add_dummy_prefix
: intend to delimit duplication for vocab:mint _, _I _like _mint _.
->_mint _, _I _like _mint _.
, duplicationmint
/_mint
->_mint
.薄荷 , 我 喜欢 薄荷 。
->_薄荷 , 我 喜欢 薄荷 。
, diverse薄荷
->薄荷
/_薄荷
. Natural conflict for language w/ & w/o space, except a pre-tokenizer (jieba/pkuseg/mecab) is applied to CJK language making it also a space-delimited word sequence.Experiments: Tokenizer mode
null
(none
without option), SentencePiece typeunigram
, vocab size 48kConclusion:
add_dummy_prefix
for EN, No for CJK.The issue of using Tokenizer as pre-tokenizer combined with other subword methods
The sentence will first handled by Tokenizer, the resulting tokens(or phrase) is passed to SentencePiece/BPE. If we compare the procedure w.r.t. normal process w/o Tokenizer's inner learn method, it will be (str -[Tokenizer]-> list[str] -> SPM/BPE) v.s. (str -[pretokenizer] -> str -> SPM/BPE).
Example:
我很喜欢Apple的IPad。
' '.join: '我很喜欢 ⦅mrk_case_modifier_C⦆ apple 的 ⦅mrk_begin_case_region_U⦆ ip ⦅mrk_end_case_region_U⦆ ad 。'
prefer
中文 EN word中文
-> [‘中文’, ‘▁EN’, ‘▁word’, ‘中文’]If we pre-tokenize the file and concatenated the resulting tokens into new lines, and learn the subword model from it. The procedure is different w.r.t. calling
.ingest_file
and.learn
of Tokenizer. This difference does not influence the result if the subword method is BPE. Because BPE will always break the line into tokens by whitespace. Then characters to learn what to join. While not true for SentencePiece, as whitespace will be escaped by a space character▁
, and initialize a large vocabulary, then learn to gradually decrease vocabulary size.The issue with Tokenizer
none
modeNot working with most features including
case_markup
,soft_case_regions
, making the large amount of vocab duplication an issue.Experiments comparing different mode:
add_dummy_prefix
for EN, No for ZH.Other options:
spacer_annotate
,case_markup
,soft_case_regions
,segment_alphabet_change
,preserve_placeholders
,preserve_segmented_tokens
Vocab Duplication compare:
Vocab Difference:
segment_alphabet_change
can be usefulpreserve_segmented_tokens
not sure: potential a bugsegment_numbers
: may need to see how this impact on model inferenceDuplicate vocab:
Vocab difference:
spacer_new
: don’t impact subword vocab, only change final tokenized output