Zenglinxiao commented 4 years ago

Since SentencePiece does not split unknown token by default which resulting larger vocabulary in tokenized corpora. Also, the Tokenizer seems not work very well with spacer, more segmentation made by Tokenizer, more vocabulary will be found when extracting vocabulary in tokenized corpora even though the duplication will be less. Then, switching to BPE seems to be a promising workaround as the unknown token will be split automatically.

Zenglinxiao commented 4 years ago

Experiments with BPE:

Experiment corpora are the same as #1, for efficiency, BPE requires the input to be pre-tokenized. We use Tokenizer for EN, pkuseg + Tokenizer for ZH.

shared vs. non-shared with the same amount of merge operation, cut vocab to 48k

non-shared 32k merge BPE resulting vocabulary 41k EN side, 65k ZH side, cut vocab with a frequency lower than 10, give 33k EN / 47k ZH.

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	26.64	26.89	25.16	29.72
10000	29.68	29.67	29.1	32.26
15000	30.64	30.57	30.3	33.46
20000	31.25	30.9	30.87	34.43
25000	31.2	31.11	31.13	34.86
30000	31.2	31.39	31.06	35.1
35000	31.37	31.47	30.83	35.21
40000	31.3	31.45	31.08	35.2
45000	31.32	31.43	31.11	35.28

shared 32k merge BPE resulting vocabulary 65k(27k/55k), cut vocab with a frequency lower than 10, give us 49k(21k/34k) then we cut to 48k in onmt.

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	24.97	24.93	23.43	27.88
10000	28.66	28.69	28.37	31.45
15000	29.83	29.93	29.88	32.68
20000	30.37	30.59	30.62	33.51
25000	30.69	30.87	30.86	33.96
30000	30.92	30.94	31.01	34.34
35000	30.94	31.25	31.26	34.38
40000	31.09	31.19	31.42	34.72
45000	31.01	31.06	31.35	34.67
50000	31.09	30.88	31.28	34.74

Conclusion:
- model trained with BPE generally performs better than SentencePiece for en->zh, even with less vocab en side.
- under same resource setting, non-shared generally get better results when involved language does not share alphabet, as more merge operation for each side resulting in more word-level tokens.

shared vs. non-shared with a similar amount of side vocabulary

shared 48k merge BPE resulting vocabulary 89k(37k/73k), ignore vocab with a frequency lower than 10, give us 66k(29k/44k) merged vocabulary, then we cut to 64k within onmt.

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	26.33	26.48	25.21	29.17
10000	29.46	29.5	29.5	32.01
15000	30.17	30.28	30.29	33.74
20000	30.64	30.94	31.15	34.4
25000	31.09	31.05	31.42	34.85
30000	31.07	31.2	31.2	35.05
35000	31.25	31.14	31.11	35.09
40000	31.14	31.21	31.3	35.12
45000	31.3	31.09	31.14	35.2
50000	31.28	31.09	31.07	35.22

Conclusion:
- To get similar performance to the non-shared version, we can increase the merge operation (ex. 32->48) to catch up with the size of each side vocabulary to some extend(src: 21->30 ~ 33, tgt: 34->44 ~ 47)

shared BPE using Best practice

shared 32k merge BPE resulting vocabulary 65k(27k/55k), use the sided vocabulary as subword vocabulary of Tokenizer, and set vocabulary_threshold to cut vocab with a frequency lower than 10 (this should give us 49k merged vocabulary), we then use this vocabulary constrained tokenizer to tokenize train/valid/test data from which getting vocabulary for the NN model, resulting in 54k(22k/39k) vocab cut freq<10 49k(21k/34k) then we cut to 48k within onmt.

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	25.78	25.77	24.26	28.35
10000	28.91	28.83	28.3	31.74
15000	29.92	29.86	29.69	32.83
20000	30.65	30.68	30.56	33.86
25000	30.95	31.08	30.8	34.3
30000	31.07	31.15	30.88	34.59
35000	31.15	31.29	31.04	34.59
40000	31.26	31.62	31.09	34.91
45000	31.48	31.47	31.19	34.95
50000	31.52	31.45	31.25	35.18

3 + best practice: shared 48k merge BPE resulting vocabulary 89k(37k/73k), use the sided vocabulary as subword vocabulary of Tokenizer, and set vocabulary_threshold 10, re-apply this restricted tokenize model to corpora result in (30/49) vocabulary, ignore frequency less than 10 give 69(29k/44k) merged vocabulary, then we cut to 64k within onmt.

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	26.62	26.91	25.36	29.81
10000	29.46	29.56	29.69	32.48
15000	30.48	30.69	30.43	33.57
20000	31.02	31.13	31.24	34.28
25000	31.11	31.29	31.31	34.85
30000	31.37	31.25	31.43	35.05
35000	31.41	31.43	31.43	35.14
40000	31.48	31.56	31.27	35.10
45000	31.40	31.55	31.25	35.18
50000	31.50	31.51	31.21	35.18

Conclusion:
- Use best practice suggested in subword-nmt to restrict vocabulary indeed improves performance and compensate the loss

Can non-shared BPE also benefit from vocabulary restriction?

1 + best practice: re-apply tokenizer(restrict vocab > 10) give 35/52 each side, we ignore freq<10, result 33/47 in onmt vocab

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	26.95	26.92	25.11	29.68
10000	29.84	29.51	29.13	32.5
15000	30.63	30.47	30.2	33.81
20000	31.15	31.05	30.98	34.39
25000	31.26	31.27	31.05	34.82
30000	31.47	31.49	31.32	34.92
35000	31.49	31.64	31.28	35.07
40000	31.56	31.55	31.37	35.23
45000	31.61	31.61	31.52	35.26
50000	31.6	31.53	31.54	35.37

Conclusion:
- Restrict vocabulary even for the non-shared model could also boost performance, should be prompt as the de-facto standard for BPE

Zenglinxiao commented 4 years ago

Segment Chinese Alphabet: Char level ZH model

Rather than segment Chinese using pkuseg, in this section, we test the performance of segmenting Chinese into character level.

non-shared: Tokenizer[segment_alphabet 'Han', segment_alphabet_change True] + 32k(EN)/1k(ZH) bpe merge, resulting vocab [43k/17k] cut freq<10 -> [33k/12k] step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	24.29	24.63	22.64	26.06
10000	27.92	27.92	27.23	29.87
15000	29.32	28.94	28.63	31.39
20000	30.14	29.7	29.43	32.29
25000	30.67	30.27	29.91	32.73
30000	30.71	30.58	30.25	33.07
35000	30.99	30.84	30.65	33.37
40000	31.11	30.81	30.59	33.56
45000	31.17	31.02	30.51	33.57
50000	31.35	30.93	30.52	33.75

non-shared: Tokenizer for EN, ZH script segmentation (append joiner) + Tokenizer[support_prior_joiner], resulting vocab [41k/15k] cut freq<10 -> [33k/10k]

non-shared: Tokenizer for EN, ZH script segmentation + Tokenizer[support_prior_joiner], resulting vocab [41k/11k] cut freq<10 -> [33k/7.7k] step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	23.96	23.88	22.43	25.76
10000	27.79	27.84	27.35	29.6
15000	29.1	28.87	28.94	30.89
20000	29.72	29.6	29.58	32.01
25000	30.19	30.01	30.26	32.35
30000	30.5	30.19	30.43	32.68
35000	30.65	30.51	30.67	32.98
40000	30.94	30.52	30.7	33.17
45000	30.76	30.62	30.47	33.3
50000	30.85	30.67	30.64	33.38

Conclusion: By not append joiner, we decrease 1/3 vocabulary size attribute to vocabulary duplication caused by a joiner, but the performance is also been impacted

Joiner will be prepended to the following punctuation rather than append to the token if we use the segment_alphabet of Tokenizer
By adding space between every Chinese character, we risk of not adding marker to punctuations and other scripts, this may make the model hard to learn when to stop: we see some example just stop generating at a punctuation
Joiner of a token will be removed if this token is at the end of the sentence, this introduces, therefore, duplication of vocabulary (心￭ & 心)

Will shared perform equally?

shared: ZH script segmentation + Tokenizer[support_prior_joiner], shared vocab cut freq < 10, resulting [41k/34k] -> [33k/16k] -> share[40k] step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	21.37	21.48	19.94	22.99
10000	25.87	25.95	25.52	28.0
15000	27.41	27.46	27.17	29.52
20000	28.4	28.25	28.27	30.74
25000	29.06	28.81	28.91	31.21
30000	29.19	29.16	29.21	31.65
35000	29.37	29.35	29.45	31.84
40000	29.58	29.5	29.66	31.93
45000	29.62	29.6	29.63	32.24
50000	29.63	29.78	29.69	32.42

shared: ZH script segmentation + Tokenizer[support_prior_joiner], get_vocab ([41k/34k]) following subword_nmt best practice, then limiting generate subword in vocabulary with vocabulary_threshold 10 ([33k/16k]), reapply this constrained version Tokenizer to the corpus to get the vocab ([34k/19k]), then cut freq<10 ([33k/16k]) which is used as model's vocab (39.7k after share)

step	newsdev2017-enzh	newstest2017-enzh	newstest2019-enzh	newstest2020-zhen
5000	21.82	21.64	20.29	23.67
10000	26.19	26.58	25.89	28.34
15000	27.68	27.74	27.53	29.91
20000	28.62	28.51	28.49	31.06
25000	29.14	28.9	29.1	31.57
30000	29.36	29.0	29.28	31.86
35000	29.78	29.44	29.46	32.27
40000	29.83	29.5	29.29	32.49
45000	29.85	29.64	29.4	32.64
50000	30.04	29.75	29.51	32.75

Conclusion:

In general, shared BPE performs consistently less good compare to non shared across all testset. Notice that this is true in a setting where Chinese is in Char level (total merge operation is similar: 33+1 for non-share, 33 for share): the English side got same vocabulary size before/after vocabulary min frequency cut, Chinese side vocabulary size slight bigger since shared BPE produce the English part of Chinese text more coarse-grained
best practice recommended by subword-nmt indeed works better for a share BPE model, the gain comes from:
- constraint model to only produce subword in one side's vocabulary above a certain frequency, this will divide that low-frequency merge-pair into smaller pieces: then at test time, tokens that not included in the vocabulary will be break into smaller pieces which may be better learned in the model (as those smaller pieces may be high frequency)
- Even in this experiment, the final model vocabulary size did not decrease, but we remove less when building model vocabulary([41-33|34-16] vs [34-33|19-16]): those vocabulary removed during the model build will still exist in subword sentence, resulting in more mass in the . But if that part is limited in the subword model directly, those low-frequency vocabs will not be present/merged and stays as some high-frequency subword sequence, then less mass and a better consistency
- Some vocabulary(merged piece) can be frequent in one side but not the other, this side vocabulary constraint approach may result in two different segmentation for the same word, therefore behaves kind like BPE-dropout, may make the model more robust

Zenglinxiao / Tokenizer

Tokenizer with BPE #2

Experiments with BPE:

shared vs. non-shared with the same amount of merge operation, cut vocab to 48k

shared vs. non-shared with a similar amount of side vocabulary

shared BPE using Best practice

Can non-shared BPE also benefit from vocabulary restriction?

Segment Chinese Alphabet: Char level ZH model

Will shared perform equally?