Zenglinxiao / Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support
https://opennmt.net/
MIT License
0 stars 0 forks source link

Tokenizer with BPE #2

Open Zenglinxiao opened 4 years ago

Zenglinxiao commented 4 years ago

Since SentencePiece does not split unknown token by default which resulting larger vocabulary in tokenized corpora. Also, the Tokenizer seems not work very well with spacer, more segmentation made by Tokenizer, more vocabulary will be found when extracting vocabulary in tokenized corpora even though the duplication will be less. Then, switching to BPE seems to be a promising workaround as the unknown token will be split automatically.

Zenglinxiao commented 4 years ago

Experiments with BPE:

Experiment corpora are the same as #1, for efficiency, BPE requires the input to be pre-tokenized. We use Tokenizer for EN, pkuseg + Tokenizer for ZH.

shared vs. non-shared with the same amount of merge operation, cut vocab to 48k

  1. non-shared 32k merge BPE resulting vocabulary 41k EN side, 65k ZH side, cut vocab with a frequency lower than 10, give 33k EN / 47k ZH.

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.64 26.89 25.16 29.72
    10000 29.68 29.67 29.1 32.26
    15000 30.64 30.57 30.3 33.46
    20000 31.25 30.9 30.87 34.43
    25000 31.2 31.11 31.13 34.86
    30000 31.2 31.39 31.06 35.1
    35000 31.37 31.47 30.83 35.21
    40000 31.3 31.45 31.08 35.2
    45000 31.32 31.43 31.11 35.28
  2. shared 32k merge BPE resulting vocabulary 65k(27k/55k), cut vocab with a frequency lower than 10, give us 49k(21k/34k) then we cut to 48k in onmt.

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 24.97 24.93 23.43 27.88
    10000 28.66 28.69 28.37 31.45
    15000 29.83 29.93 29.88 32.68
    20000 30.37 30.59 30.62 33.51
    25000 30.69 30.87 30.86 33.96
    30000 30.92 30.94 31.01 34.34
    35000 30.94 31.25 31.26 34.38
    40000 31.09 31.19 31.42 34.72
    45000 31.01 31.06 31.35 34.67
    50000 31.09 30.88 31.28 34.74

shared vs. non-shared with a similar amount of side vocabulary

  1. shared 48k merge BPE resulting vocabulary 89k(37k/73k), ignore vocab with a frequency lower than 10, give us 66k(29k/44k) merged vocabulary, then we cut to 64k within onmt.

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.33 26.48 25.21 29.17
    10000 29.46 29.5 29.5 32.01
    15000 30.17 30.28 30.29 33.74
    20000 30.64 30.94 31.15 34.4
    25000 31.09 31.05 31.42 34.85
    30000 31.07 31.2 31.2 35.05
    35000 31.25 31.14 31.11 35.09
    40000 31.14 31.21 31.3 35.12
    45000 31.3 31.09 31.14 35.2
    50000 31.28 31.09 31.07 35.22

shared BPE using Best practice

  1. shared 32k merge BPE resulting vocabulary 65k(27k/55k), use the sided vocabulary as subword vocabulary of Tokenizer, and set vocabulary_threshold to cut vocab with a frequency lower than 10 (this should give us 49k merged vocabulary), we then use this vocabulary constrained tokenizer to tokenize train/valid/test data from which getting vocabulary for the NN model, resulting in 54k(22k/39k) vocab cut freq<10 49k(21k/34k) then we cut to 48k within onmt.

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 25.78 25.77 24.26 28.35
    10000 28.91 28.83 28.3 31.74
    15000 29.92 29.86 29.69 32.83
    20000 30.65 30.68 30.56 33.86
    25000 30.95 31.08 30.8 34.3
    30000 31.07 31.15 30.88 34.59
    35000 31.15 31.29 31.04 34.59
    40000 31.26 31.62 31.09 34.91
    45000 31.48 31.47 31.19 34.95
    50000 31.52 31.45 31.25 35.18
  2. 3 + best practice: shared 48k merge BPE resulting vocabulary 89k(37k/73k), use the sided vocabulary as subword vocabulary of Tokenizer, and set vocabulary_threshold 10, re-apply this restricted tokenize model to corpora result in (30/49) vocabulary, ignore frequency less than 10 give 69(29k/44k) merged vocabulary, then we cut to 64k within onmt.

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.62 26.91 25.36 29.81
    10000 29.46 29.56 29.69 32.48
    15000 30.48 30.69 30.43 33.57
    20000 31.02 31.13 31.24 34.28
    25000 31.11 31.29 31.31 34.85
    30000 31.37 31.25 31.43 35.05
    35000 31.41 31.43 31.43 35.14
    40000 31.48 31.56 31.27 35.10
    45000 31.40 31.55 31.25 35.18
    50000 31.50 31.51 31.21 35.18

Can non-shared BPE also benefit from vocabulary restriction?

  1. 1 + best practice: re-apply tokenizer(restrict vocab > 10) give 35/52 each side, we ignore freq<10, result 33/47 in onmt vocab

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.95 26.92 25.11 29.68
    10000 29.84 29.51 29.13 32.5
    15000 30.63 30.47 30.2 33.81
    20000 31.15 31.05 30.98 34.39
    25000 31.26 31.27 31.05 34.82
    30000 31.47 31.49 31.32 34.92
    35000 31.49 31.64 31.28 35.07
    40000 31.56 31.55 31.37 35.23
    45000 31.61 31.61 31.52 35.26
    50000 31.6 31.53 31.54 35.37
Zenglinxiao commented 4 years ago

Segment Chinese Alphabet: Char level ZH model

Rather than segment Chinese using pkuseg, in this section, we test the performance of segmenting Chinese into character level.

  1. non-shared: Tokenizer[segment_alphabet 'Han', segment_alphabet_change True] + 32k(EN)/1k(ZH) bpe merge, resulting vocab [43k/17k] cut freq<10 -> [33k/12k] step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 24.29 24.63 22.64 26.06
    10000 27.92 27.92 27.23 29.87
    15000 29.32 28.94 28.63 31.39
    20000 30.14 29.7 29.43 32.29
    25000 30.67 30.27 29.91 32.73
    30000 30.71 30.58 30.25 33.07
    35000 30.99 30.84 30.65 33.37
    40000 31.11 30.81 30.59 33.56
    45000 31.17 31.02 30.51 33.57
    50000 31.35 30.93 30.52 33.75
  2. non-shared: Tokenizer for EN, ZH script segmentation (append joiner) + Tokenizer[support_prior_joiner], resulting vocab [41k/15k] cut freq<10 -> [33k/10k]

  3. non-shared: Tokenizer for EN, ZH script segmentation + Tokenizer[support_prior_joiner], resulting vocab [41k/11k] cut freq<10 -> [33k/7.7k] step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 23.96 23.88 22.43 25.76
    10000 27.79 27.84 27.35 29.6
    15000 29.1 28.87 28.94 30.89
    20000 29.72 29.6 29.58 32.01
    25000 30.19 30.01 30.26 32.35
    30000 30.5 30.19 30.43 32.68
    35000 30.65 30.51 30.67 32.98
    40000 30.94 30.52 30.7 33.17
    45000 30.76 30.62 30.47 33.3
    50000 30.85 30.67 30.64 33.38

Conclusion: By not append joiner, we decrease 1/3 vocabulary size attribute to vocabulary duplication caused by a joiner, but the performance is also been impacted

Will shared perform equally?

  1. shared: ZH script segmentation + Tokenizer[support_prior_joiner], shared vocab cut freq < 10, resulting [41k/34k] -> [33k/16k] -> share[40k] step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 21.37 21.48 19.94 22.99
    10000 25.87 25.95 25.52 28.0
    15000 27.41 27.46 27.17 29.52
    20000 28.4 28.25 28.27 30.74
    25000 29.06 28.81 28.91 31.21
    30000 29.19 29.16 29.21 31.65
    35000 29.37 29.35 29.45 31.84
    40000 29.58 29.5 29.66 31.93
    45000 29.62 29.6 29.63 32.24
    50000 29.63 29.78 29.69 32.42
  2. shared: ZH script segmentation + Tokenizer[support_prior_joiner], get_vocab ([41k/34k]) following subword_nmt best practice, then limiting generate subword in vocabulary with vocabulary_threshold 10 ([33k/16k]), reapply this constrained version Tokenizer to the corpus to get the vocab ([34k/19k]), then cut freq<10 ([33k/16k]) which is used as model's vocab (39.7k after share)

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 21.82 21.64 20.29 23.67
    10000 26.19 26.58 25.89 28.34
    15000 27.68 27.74 27.53 29.91
    20000 28.62 28.51 28.49 31.06
    25000 29.14 28.9 29.1 31.57
    30000 29.36 29.0 29.28 31.86
    35000 29.78 29.44 29.46 32.27
    40000 29.83 29.5 29.29 32.49
    45000 29.85 29.64 29.4 32.64
    50000 30.04 29.75 29.51 32.75

Conclusion:

  1. In general, shared BPE performs consistently less good compare to non shared across all testset. Notice that this is true in a setting where Chinese is in Char level (total merge operation is similar: 33+1 for non-share, 33 for share): the English side got same vocabulary size before/after vocabulary min frequency cut, Chinese side vocabulary size slight bigger since shared BPE produce the English part of Chinese text more coarse-grained
  2. best practice recommended by subword-nmt indeed works better for a share BPE model, the gain comes from:
    • constraint model to only produce subword in one side's vocabulary above a certain frequency, this will divide that low-frequency merge-pair into smaller pieces: then at test time, tokens that not included in the vocabulary will be break into smaller pieces which may be better learned in the model (as those smaller pieces may be high frequency)
    • Even in this experiment, the final model vocabulary size did not decrease, but we remove less when building model vocabulary([41-33|34-16] vs [34-33|19-16]): those vocabulary removed during the model build will still exist in subword sentence, resulting in more mass in the . But if that part is limited in the subword model directly, those low-frequency vocabs will not be present/merged and stays as some high-frequency subword sequence, then less mass and a better consistency
    • Some vocabulary(merged piece) can be frequent in one side but not the other, this side vocabulary constraint approach may result in two different segmentation for the same word, therefore behaves kind like BPE-dropout, may make the model more robust