Open Zenglinxiao opened 4 years ago
Experiment corpora are the same as #1, for efficiency, BPE requires the input to be pre-tokenized. We use Tokenizer for EN, pkuseg + Tokenizer for ZH.
non-shared 32k merge BPE resulting vocabulary 41k EN side, 65k ZH side, cut vocab with a frequency lower than 10, give 33k EN / 47k ZH.
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.64 | 26.89 | 25.16 | 29.72 |
10000 | 29.68 | 29.67 | 29.1 | 32.26 |
15000 | 30.64 | 30.57 | 30.3 | 33.46 |
20000 | 31.25 | 30.9 | 30.87 | 34.43 |
25000 | 31.2 | 31.11 | 31.13 | 34.86 |
30000 | 31.2 | 31.39 | 31.06 | 35.1 |
35000 | 31.37 | 31.47 | 30.83 | 35.21 |
40000 | 31.3 | 31.45 | 31.08 | 35.2 |
45000 | 31.32 | 31.43 | 31.11 | 35.28 |
shared 32k merge BPE resulting vocabulary 65k(27k/55k), cut vocab with a frequency lower than 10, give us 49k(21k/34k) then we cut to 48k in onmt.
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 24.97 | 24.93 | 23.43 | 27.88 |
10000 | 28.66 | 28.69 | 28.37 | 31.45 |
15000 | 29.83 | 29.93 | 29.88 | 32.68 |
20000 | 30.37 | 30.59 | 30.62 | 33.51 |
25000 | 30.69 | 30.87 | 30.86 | 33.96 |
30000 | 30.92 | 30.94 | 31.01 | 34.34 |
35000 | 30.94 | 31.25 | 31.26 | 34.38 |
40000 | 31.09 | 31.19 | 31.42 | 34.72 |
45000 | 31.01 | 31.06 | 31.35 | 34.67 |
50000 | 31.09 | 30.88 | 31.28 | 34.74 |
shared 48k merge BPE resulting vocabulary 89k(37k/73k), ignore vocab with a frequency lower than 10, give us 66k(29k/44k) merged vocabulary, then we cut to 64k within onmt.
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.33 | 26.48 | 25.21 | 29.17 |
10000 | 29.46 | 29.5 | 29.5 | 32.01 |
15000 | 30.17 | 30.28 | 30.29 | 33.74 |
20000 | 30.64 | 30.94 | 31.15 | 34.4 |
25000 | 31.09 | 31.05 | 31.42 | 34.85 |
30000 | 31.07 | 31.2 | 31.2 | 35.05 |
35000 | 31.25 | 31.14 | 31.11 | 35.09 |
40000 | 31.14 | 31.21 | 31.3 | 35.12 |
45000 | 31.3 | 31.09 | 31.14 | 35.2 |
50000 | 31.28 | 31.09 | 31.07 | 35.22 |
shared 32k merge BPE resulting vocabulary 65k(27k/55k), use the sided vocabulary as subword vocabulary
of Tokenizer
, and set vocabulary_threshold
to cut vocab with a frequency lower than 10 (this should give us 49k merged vocabulary), we then use this vocabulary constrained tokenizer to tokenize train/valid/test data from which getting vocabulary for the NN model, resulting in 54k(22k/39k) vocab cut freq<10 49k(21k/34k) then we cut to 48k within onmt.
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 25.78 | 25.77 | 24.26 | 28.35 |
10000 | 28.91 | 28.83 | 28.3 | 31.74 |
15000 | 29.92 | 29.86 | 29.69 | 32.83 |
20000 | 30.65 | 30.68 | 30.56 | 33.86 |
25000 | 30.95 | 31.08 | 30.8 | 34.3 |
30000 | 31.07 | 31.15 | 30.88 | 34.59 |
35000 | 31.15 | 31.29 | 31.04 | 34.59 |
40000 | 31.26 | 31.62 | 31.09 | 34.91 |
45000 | 31.48 | 31.47 | 31.19 | 34.95 |
50000 | 31.52 | 31.45 | 31.25 | 35.18 |
3 + best practice: shared 48k merge BPE resulting vocabulary 89k(37k/73k), use the sided vocabulary as subword vocabulary
of Tokenizer
, and set vocabulary_threshold
10, re-apply this restricted tokenize model to corpora result in (30/49) vocabulary, ignore frequency less than 10 give 69(29k/44k) merged vocabulary, then we cut to 64k within onmt.
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.62 | 26.91 | 25.36 | 29.81 |
10000 | 29.46 | 29.56 | 29.69 | 32.48 |
15000 | 30.48 | 30.69 | 30.43 | 33.57 |
20000 | 31.02 | 31.13 | 31.24 | 34.28 |
25000 | 31.11 | 31.29 | 31.31 | 34.85 |
30000 | 31.37 | 31.25 | 31.43 | 35.05 |
35000 | 31.41 | 31.43 | 31.43 | 35.14 |
40000 | 31.48 | 31.56 | 31.27 | 35.10 |
45000 | 31.40 | 31.55 | 31.25 | 35.18 |
50000 | 31.50 | 31.51 | 31.21 | 35.18 |
1 + best practice: re-apply tokenizer(restrict vocab > 10) give 35/52 each side, we ignore freq<10, result 33/47 in onmt vocab
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 26.95 | 26.92 | 25.11 | 29.68 |
10000 | 29.84 | 29.51 | 29.13 | 32.5 |
15000 | 30.63 | 30.47 | 30.2 | 33.81 |
20000 | 31.15 | 31.05 | 30.98 | 34.39 |
25000 | 31.26 | 31.27 | 31.05 | 34.82 |
30000 | 31.47 | 31.49 | 31.32 | 34.92 |
35000 | 31.49 | 31.64 | 31.28 | 35.07 |
40000 | 31.56 | 31.55 | 31.37 | 35.23 |
45000 | 31.61 | 31.61 | 31.52 | 35.26 |
50000 | 31.6 | 31.53 | 31.54 | 35.37 |
Rather than segment Chinese using pkuseg, in this section, we test the performance of segmenting Chinese into character level.
non-shared: Tokenizer[segment_alphabet 'Han', segment_alphabet_change True] + 32k(EN)/1k(ZH) bpe merge, resulting vocab [43k/17k] cut freq<10 -> [33k/12k] step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 24.29 | 24.63 | 22.64 | 26.06 |
10000 | 27.92 | 27.92 | 27.23 | 29.87 |
15000 | 29.32 | 28.94 | 28.63 | 31.39 |
20000 | 30.14 | 29.7 | 29.43 | 32.29 |
25000 | 30.67 | 30.27 | 29.91 | 32.73 |
30000 | 30.71 | 30.58 | 30.25 | 33.07 |
35000 | 30.99 | 30.84 | 30.65 | 33.37 |
40000 | 31.11 | 30.81 | 30.59 | 33.56 |
45000 | 31.17 | 31.02 | 30.51 | 33.57 |
50000 | 31.35 | 30.93 | 30.52 | 33.75 |
non-shared: Tokenizer for EN, ZH script segmentation (append joiner) + Tokenizer[support_prior_joiner], resulting vocab [41k/15k] cut freq<10 -> [33k/10k]
non-shared: Tokenizer for EN, ZH script segmentation + Tokenizer[support_prior_joiner], resulting vocab [41k/11k] cut freq<10 -> [33k/7.7k] step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 23.96 | 23.88 | 22.43 | 25.76 |
10000 | 27.79 | 27.84 | 27.35 | 29.6 |
15000 | 29.1 | 28.87 | 28.94 | 30.89 |
20000 | 29.72 | 29.6 | 29.58 | 32.01 |
25000 | 30.19 | 30.01 | 30.26 | 32.35 |
30000 | 30.5 | 30.19 | 30.43 | 32.68 |
35000 | 30.65 | 30.51 | 30.67 | 32.98 |
40000 | 30.94 | 30.52 | 30.7 | 33.17 |
45000 | 30.76 | 30.62 | 30.47 | 33.3 |
50000 | 30.85 | 30.67 | 30.64 | 33.38 |
Conclusion: By not append joiner, we decrease 1/3 vocabulary size attribute to vocabulary duplication caused by a joiner, but the performance is also been impacted
segment_alphabet
of Tokenizer心■
& 心
)shared: ZH script segmentation + Tokenizer[support_prior_joiner], shared vocab cut freq < 10, resulting [41k/34k] -> [33k/16k] -> share[40k] step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 21.37 | 21.48 | 19.94 | 22.99 |
10000 | 25.87 | 25.95 | 25.52 | 28.0 |
15000 | 27.41 | 27.46 | 27.17 | 29.52 |
20000 | 28.4 | 28.25 | 28.27 | 30.74 |
25000 | 29.06 | 28.81 | 28.91 | 31.21 |
30000 | 29.19 | 29.16 | 29.21 | 31.65 |
35000 | 29.37 | 29.35 | 29.45 | 31.84 |
40000 | 29.58 | 29.5 | 29.66 | 31.93 |
45000 | 29.62 | 29.6 | 29.63 | 32.24 |
50000 | 29.63 | 29.78 | 29.69 | 32.42 |
shared: ZH script segmentation + Tokenizer[support_prior_joiner], get_vocab ([41k/34k]) following subword_nmt best practice, then limiting generate subword in vocabulary with vocabulary_threshold 10 ([33k/16k]), reapply this constrained version Tokenizer to the corpus to get the vocab ([34k/19k]), then cut freq<10 ([33k/16k]) which is used as model's vocab (39.7k after share)
step | newsdev2017-enzh | newstest2017-enzh | newstest2019-enzh | newstest2020-zhen |
---|---|---|---|---|
5000 | 21.82 | 21.64 | 20.29 | 23.67 |
10000 | 26.19 | 26.58 | 25.89 | 28.34 |
15000 | 27.68 | 27.74 | 27.53 | 29.91 |
20000 | 28.62 | 28.51 | 28.49 | 31.06 |
25000 | 29.14 | 28.9 | 29.1 | 31.57 |
30000 | 29.36 | 29.0 | 29.28 | 31.86 |
35000 | 29.78 | 29.44 | 29.46 | 32.27 |
40000 | 29.83 | 29.5 | 29.29 | 32.49 |
45000 | 29.85 | 29.64 | 29.4 | 32.64 |
50000 | 30.04 | 29.75 | 29.51 | 32.75 |
Conclusion:
Since SentencePiece does not split unknown token by default which resulting larger vocabulary in tokenized corpora. Also, the Tokenizer seems not work very well with spacer, more segmentation made by Tokenizer, more vocabulary will be found when extracting vocabulary in tokenized corpora even though the duplication will be less. Then, switching to BPE seems to be a promising workaround as the unknown token will be split automatically.