Zenglinxiao / Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support
https://opennmt.net/
MIT License
0 stars 0 forks source link

Tokenizer with SentencePiece #1

Open Zenglinxiao opened 3 years ago

Zenglinxiao commented 3 years ago

Currently, we use SentencePiece in Tokenizer for our models contain ZH/JA in which no space serves as a natural word boundary. The SentencePiece model is applied after Tokenizer's none mode. node mode will only split placeholder ⦅ ⦆ from the sentence, and feed the list of resulting phrase into SentencePiece.

The issue with joint SentencePiece model for European & CJK Languages

We share src and tgt to train a unified SentencePiece Model causing confilct for:

The issue of using Tokenizer as pre-tokenizer combined with other subword methods

The sentence will first handled by Tokenizer, the resulting tokens(or phrase) is passed to SentencePiece/BPE. If we compare the procedure w.r.t. normal process w/o Tokenizer's inner learn method, it will be (str -[Tokenizer]-> list[str] -> SPM/BPE) v.s. (str -[pretokenizer] -> str -> SPM/BPE).

If we pre-tokenize the file and concatenated the resulting tokens into new lines, and learn the subword model from it. The procedure is different w.r.t. calling .ingest_file and .learn of Tokenizer. This difference does not influence the result if the subword method is BPE. Because BPE will always break the line into tokens by whitespace. Then characters to learn what to join. While not true for SentencePiece, as whitespace will be escaped by a space character , and initialize a large vocabulary, then learn to gradually decrease vocabulary size.

The issue with Tokenizer none mode

Not working with most features including case_markup, soft_case_regions, making the large amount of vocab duplication an issue.

Experiments comparing different mode:

segment_alphabet_change can be useful

>>> pyonmttok.Tokenizer("conservative", spacer_annotate=True, case_markup=True, soft_case_regions=True).tokenize("我很喜欢Apple的IPad。")[0]
['我很喜欢', '⦅mrk_case_modifier_C⦆', 'apple的', '⦅mrk_begin_case_region_U⦆', 'ip', '⦅mrk_end_case_region_U⦆', 'ad', '。']

>>> pyonmttok.Tokenizer("conservative", spacer_annotate=True, case_markup=True, soft_case_regions=True, segment_alphabet_change=True).tokenize("我很喜欢Apple的IPad。")[0]
['我很喜欢', '⦅mrk_case_modifier_C⦆', 'apple', '的', '⦅mrk_begin_case_region_U⦆', 'ip', '⦅mrk_end_case_region_U⦆', 'ad', '。']

preserve_segmented_tokens not sure: potential a bug

>>> pyonmttok.Tokenizer("aggressive", spacer_annotate=True, case_markup=True, soft_case_regions=True, segment_alphabet_change=True, preserve_placeholders=True, preserve_segmented_tokens=False, segment_numbers=True).tokenize("我很喜欢apple store的2020款ipad。")[0]
['我很喜欢', 'apple', '▁store', '的', '2', '0', '2', '0', '款', 'ipad', '。']
>>> pyonmttok.Tokenizer("aggressive", spacer_annotate=True, case_markup=True, soft_case_regions=True, segment_alphabet_change=True, preserve_placeholders=True, preserve_segmented_tokens=True, segment_numbers=True).tokenize("我很喜欢apple store的2020款ipad。")[0]
['我很喜欢', 'apple', '▁', 'store', '的', '2', '0', '2', '0', '款', 'ipad', '。']

segment_numbers: may need to see how this impact on model inference

spacer_new: don’t impact subword vocab, only change final tokenized output

Zenglinxiao commented 3 years ago

In short SHOULD:

And:

To evaluate:

francoishernandez commented 3 years ago

Warning: spacer_new has too much impact on the sequence length.

Do we know where the approx. 4.8k remaining duplicates in the vocab come from in EN/aggressive?

preserve_segmented_tokens behavior is probably expected in your case: the 'store' is segmented with segment_alphabet_change, hence the spacer is separated.

Do not use shared SentencePiece model between Latin alphabet languages(EN/FR/DE/...) and Chinese alphabet languages(ZH/JA/...);

What would be the best setup then?

Zenglinxiao commented 3 years ago

Warning: spacer_new has too much impact on the sequence length.

Yes. And It does not change the vocabulary. All it does is just making spacer a separate character. Hence useless.

preserve_segmented_tokens behavior is probably expected in your case: the 'store' is segmented with segment_alphabet_change, hence the spacer is separated.

Accroding to option describtion page, it claims to not attach joiner/spacer to token segmented by segment_* option. The example I showed above, we segment actually the from the _store的. The _store is actually created/segmented before segment_* step in the .tokenize. Below showed the same example disable all segment_* options.

   >>> pyonmttok.Tokenizer("aggressive", spacer_annotate=True, case_markup=True, soft_case_regions=True, segment_alphabet_change=False, preserve_placeholders=True, preserve_segmented_tokens=False, segment_numbers=False).tokenize("我很喜欢apple store的2020款ipad。")[0]
['我很喜欢apple', '▁store的', '2020', '款ipad', '。']

The ideal segmentation should be:

Do not use shared SentencePiece model between Latin alphabet languages(EN/FR/DE/...) and Chinese alphabet languages(ZH/JA/...);

What would be the best setup then?

As least for SentencePiece combination:

Zenglinxiao commented 3 years ago

Do we know where the approx. 4.8k remaining duplicates in the vocab come from in EN/aggressive?

4.8k indicate potential duplication, not all of them are duplicate. Because we consider a duplicate when there is ▁token and token, sometime token is a word but can also be part of word, where I think for the majority of considered "duplication" is in this case.

Besides, some pattern indeed resulting duplication, a most common one is preceding special characters:

Other pattern may exist, but not find so far.

Zenglinxiao commented 3 years ago

Another issue should mention is that, placeholder will always be an issue and may resulting duplicate vocab when existing in corpus.

Removing all placeholder when learn_subword may be benefical. Consider the following example: ⦅_tgt_zh⦆ 我很喜欢apple store的2020款 ⦅_item_⦆ 。

We have ⦅_tgt_zh⦆ and ⦅_item_⦆ in the sentence, if we feed these into learn_subword, we may have ▁我, ▁。 in the subword vocab. Then sentence will then be segmented as: ['⦅_tgt_zh⦆', '▁我', '很喜欢', 'apple', '▁store', '的', '2020', '款', '▁', '⦅_item_⦆', '▁。']

If we remove these stuff (origin format including space) from sentence when learn_subword: 我很喜欢apple store的2020款 。 Then sentence will be segmented as: ['⦅_tgt_zh⦆', '▁', '我', '很喜欢', 'apple', '▁store', '的', '2020', '款', '▁', '⦅_item_⦆', '▁', '。']

⦅_tgt_zh⦆ I really like apple store's 2020 edition ⦅_item_⦆ . -> w placeholder: ["⦅_tgt_zh⦆", "▁I", "▁really", "▁like", "▁apple", "▁store", "'", "s", "▁2020", "▁edition", "▁", "⦅_item_⦆", "▁."] -> w/o placeholder ["⦅_tgt_zh⦆", "▁I", "▁really", "▁like", "▁apple", "▁store", "'", "s", "▁2020", "▁edition", "▁", "⦅_item_⦆", "▁", ".""]

Then iterating corpus tokenized by model trained w/o placeholder to get vocab for embeddings do not introduce duplicated vocab we've seen before.

Zenglinxiao commented 3 years ago

Update with the result of some experiments:

Does the segment_numbers helps?

  1. Transformer based on non-share pyonmtok(agg) + SentencePiece [48k|48k]

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.56 26.69 24.93 29.87
    10000 28.87 28.91 28.06 32.93
    15000 29.88 29.82 29.2 33.99
    20000 30.19 30.11 29.28 34.45
    25000 30.24 30.45 29.35 34.73
    30000 30.48 30.64 29.45 34.92
    35000 30.45 30.67 29.28 35.04
    40000 30.53 30.66 29.29 35.0
  2. Transformer based on non-share pyonmtok(agg) + SentencePiece [48k|48k] with segment_numbers

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.26 26.61 24.84 30.13
    10000 29.02 29.17 28.06 33.08
    15000 29.77 29.65 28.88 33.94
    20000 30.39 30.1 29.48 34.4
    25000 30.53 30.45 29.61 34.64
    30000 30.86 30.55 29.45 34.59
    35000 30.69 30.74 29.36 34.81
    40000 30.65 30.63 29.43 34.86

Conclusion: 2/4 +(~0.1), 1 ~ (<0.1), 1 -(>0.1). Basically we can use it, but do not expect too much.

Does non-share which intend to remove vocab duplication helps in NMT performance?

  1. Transformer based on non-share pyonmtok(none) + SentencePiece [48k|48k]

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.81 26.99 25.49 30.05
    10000 29.27 29.27 28.61 33.07
    15000 29.99 30.05 29.43 33.82
    20000 30.54 30.43 29.93 34.46
    25000 30.7 30.57 30.02 34.63
    30000 30.81 30.72 30.21 34.79
    35000 30.68 30.74 30.06 34.73
    40000 30.67 30.71 29.83 34.8
  2. Transformer based on share pyonmtok(none) + SentencePiece [48k]

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 26.02 26.32 24.53 29.25
    10000 28.87 29.24 28.54 32.62
    15000 29.92 29.93 29.61 33.58
    20000 30.51 30.76 29.9 34.42
    25000 30.52 30.85 30.15 34.61
    30000 30.81 31.04 30.33 34.74
    35000 30.78 31.12 30.28 34.99
    40000 30.85 31.02 30.19 35.09

Conclusion: non-share indeed made a difference at the beginning, but its advantage fast vanishing and surpassed by its shared version. The shared model got +0.2~+0.4 at 40k step compare to non-shared.

Which mode to use?

  1. Aggressive or None? If we look at the experiments non-share above, we can see from table 1/2 & 3. That 3 out of 4 testsets, these two methods give similar results, while the other (test2019), none mode got better BLEU score than aggressive by a margin of 0.4. Therefore, even the aggressive mode reduce vocab duplication, none mode is still preferred when using the SentencePiece Unigram model.

  2. Conservative or None? Following shows, the model trained with conservative mode: share pyonmtok(conservative) + SentencePiece [48k]

    step newsdev2017-enzh newstest2017-enzh newstest2019-enzh newstest2020-zhen
    5000 25.76 26.04 24.41 29.12
    10000 28.67 28.71 28.29 32.3
    15000 29.43 29.69 29.08 33.36
    20000 30.01 30.1 29.59 34.05
    25000 30.28 30.33 29.97 34.49
    30000 30.27 30.58 29.99 34.69
    35000 30.47 30.56 29.94 34.59
    40000 30.52 30.54 29.84 34.72

Conclusion: share model with none setting get better results than conservative, for all testsets.

  1. Discussion:
    • In SentencePiece, there is default option as --split_by_unicode_script, --split_by_number, --split_by_whitespace which serve some objectives similar as conservative/aggressive.
Zenglinxiao commented 3 years ago

The issue with SentencePiece: Do not handle UNK by default

The solution to this according to the author: TO BE VERIFY