Open Zenglinxiao opened 4 years ago
Case markups are not really supported for the "none" and "space" tokenization modes.
case_markup
enables segment_case
to avoid tokens with mixed casing, but "space" and "none" modes are not allowed to split in the middle of tokens. "space" mode can only split on spaces and "none" mode do not split at all.
Should we just raise an error in this case? Or maybe can you describe what was your expectation when using case_markup and space/none modes?
I want to use the case_markup
feature with sentencepiece
in order to decrease the vocab duplication caused by the case, not sure what's the best practice to do it.
As in the sp_model_path
section, you mentioned using none
for sentencepiece. Therefore I came into this issue.
Just wondering if the none
is a must for the use of sentencepiece though.
Correct me if I'm wrong:
With the original sentencepiece, each sentence is handle by: replace whitespace with spacer, then doing the segmentation.
With Tokenizer(sentencepiece model) under none
, each sentence is handled as: split placeholder off from the sentence, the rest of the sentence feed to sentencepiece to do segmentation. But if the placeholder is in the middle of the sentence, then the original sentence will be split into two-part and feed to sentencepiece separately. In the end, even with none
, the input to sentencepiece is still a list of tokens(I personally prefer "phrase" in this case) rather than a single sentence, therefore not same behavior with spm_encode
.
I then tried the following experiments:
EX1: use `pyonmttok.Tokenizer(...)` -> `.tokenize_file(corpus_file)` -> train sentencepiece model on this pretokenized corpus
EX2: Initialize `pyonmttok.Tokenizer(...)` as pretokenizer -> `learner(tokenizer, **other_opts)` -> ingest corpus_file -> learn model
These two appoarch gives different model & vocab, which I think caused by the way Tokenizer ingest file. Models learned with Tokenizer use "tokens"("phrase") rather than "sentence".
This idea of ingest_tokens
won't cause issue apparently with BPE, but for sentencepiece which expect a sentence and don't assume language-depend logic (space as natural delimiter to words is language-depend I think) may not guarantee the same result w.r.t. original sentencepiece.
So, Do you have any idea or recommandation on how to correctly use Tokenizer
when working with sentencepiece?
Your understanding is correct.
The behavior is the same as spm_encode
as long as you don't use placeholders. When you use placeholders, you are using a feature that is specific to the Tokenizer and that SentencePiece has no concept of. At this point we expect users to only use the Tokenizer, for training and applying SentencePiece models.
So the recommendation would simply be: do not use SentencePiece scripts directly.
The use case sounds reasonable. A similar issue came up in https://forum.opennmt.net/t/problem-in-tokenize-and-detokenize-during-translation/3954. The difficulty is that we need to lowercase the phrase before SentencePiece so that different casing result in the same segmentation. We would need to add some code to find the original casing after applying SentencePiece. I will look into it.
Alternatively, you can try using mode "conservative" or "aggressive". SentencePiece will be used as a subtokenizer like BPE.
I'm also trying to make sentencepiece work with case_markup
. I got it working somehow by adding the Tokenizer's case placeholders as user_defined_symbols
in sentencepiece. I still get a few <unk>
s that I don't get when using sentencepiece in none
mode, but with this hack they are reduced a lot. Now the question is: should I lowercase the corpus in order to train the sentencepiece model and get a vocabulary with all subwords lowercased?
OK, this actually seems to work. I lowercased, created a sentencepiece model and vocab with onmt-build-vocab
with the case placeholders as user_defined_symbols
and trained with the raw training files a test model for 5k steps. There are only very few unks that mostly occur in case splitting (which makes sense) and in uppercase tokens (that I can't figure out why).
Maybe this could be adapted in code, so when sentencepiece is used with a mode other than none
and with case_markup
, the case_markup placeholders should be predefined as user_defined_symbols
.
@guillaumekln , could you please clarify what happens and in what order when creating a sentencepiece model and vocab with onmt-build-vocab
with mode aggressive
and case_markup
?
could you please clarify what happens and in what order when creating a sentencepiece model and vocab with onmt-build-vocab with mode aggressive and case_markup?
Actually this is not possible with onmt-build-vocab
from OpenNMT-tf. It always applies a none
tokenization before training the SentencePiece model. Looks like we need to add some errors when trying to configure a custom tokenization (or add support for it).
To set a different tokenization, you could use the SentencePieceLearner
directly. Here's what happen when you use a tokenizer with aggressive
and case_markup
:
The insertion of case markup tokens does not happen in this learning phase. They are added during tokenization after applying the SentencePiece model.
Actually this is not possible with
onmt-build-vocab
from OpenNMT-tf. It always applies anone
tokenization before training the SentencePiece model. Looks like we need to add some errors when trying to configure a custom tokenization (or add support for it).
I see, so essentially it's like running the spm_train directly and onmt-build-vocab
just takes care of converting the vocabulary.
To set a different tokenization, you could use the
SentencePieceLearner
directly.
My attempt was to avoid a separate preprocessing step and have everything ready with onmt-build-vocab
-> train, but this seems necessary.
The insertion of case markup tokens does not happen in this learning phase. They are added during tokenization after applying the SentencePiece model.
Yes, and if the SentencePiece model already contains the case markup tokens as user defined symbols, then sentencepiece ignores them when it decodes so the case can be restored correctly and the translated text seems (mostly) fine. But some inconsistencies remain, due to case splitting that creates unseen tokens/subwords to sentencepiece.
My attempt was to avoid a separate preprocessing step and have everything ready with onmt-build-vocab -> train, but this seems necessary.
Yes. I added support for pre-tokenization in the PR linked above.
But some inconsistencies remain, due to case splitting that creates unseen tokens/subwords to sentencepiece.
When using aggressive
and case_markup
, case splitting is applied as part of the aggressive tokenization and before SentencePiece. So there should not be unseen tokens in this case.
That's fantastic, thanks!
I thought I should leave some feedback on this:
unk
s, all of them after punctuation marks (parenthesis, quotes, etc). I inspected a bit and I noticed that OpenNMTTokenizer does not add the space marker in front of such symbols. How could we eliminate this inconsistency with the way SentencePiece handles punctuation marks?bad malloc
before creating the suffix array, even though there's still RAM available. After I put a limit of 100M sentences (which should be single tokens, actually) I was able to train the model without issues. I suspect I can push this limit to 150-200M. I get lots of unks, all of them after punctuation marks (parenthesis, quotes, etc). I inspected a bit and I noticed that OpenNMTTokenizer does not add the space marker in front of such symbols. How could we eliminate this inconsistency with the way SentencePiece handles punctuation marks?
You generated the vocabulary with onmt-build-vocab
from OpenNMT-tf, right? When using SentencePiece with pre-tokenization, the output tokens are actually not meant to be compatible with the vocabulary generated by SentencePiece. We should fix the script to rebuild the vocabulary in this case.
Yes, the vocab is built with onmt-build-vocab
. I just noticed the related PR in OpenNMT-tf repo, thanks!
Some more feedback: I updated pyonmttok and OpenNMT-tf and tried to build a new vocab with sentencepiece and case_markup
. The sp model and the vocab are build, but the user-defined symbols are not included in the vocabulary, even though they are recognized and mentioned from sentencepiece when training starts.
Also, now the only option that is accepted from onmt-build-vocab
for building a sentencepiece model is none
. This means we loose some of the goodies aggressive
offers but at least we should be able to use case_markup
, right?
To summarize what was done in the latest update, there are now 2 modes when generating the SentencePiece vocabulary:
When no pretokenizer is set:
When a pretokenizer is set:
The sp model and the vocab are build, but the user-defined symbols are not included in the vocabulary, even though they are recognized and mentioned from sentencepiece when training starts.
Are the user-defined symbols in the training data? As said above, the training data is retokenized with SentencePiece so the symbols should appear in the tokenized data to be included in the vocabulary.
Also, now the only option that is accepted from onmt-build-vocab for building a sentencepiece model is none. This means we loose some of the goodies aggressive offers but at least we should be able to use case_markup, right?
You should still be able to use another tokenization mode such as aggressive. Is there an error or bug?
I should get a better grasp of it, so I could use your help. First here is the command:
onmt-build-vocab --tokenizer_config ../../../Tokenization/lower_tokenization.yml --size 32000 --sentencepiece user_defined_symbols="⦅D01⦆,⦅D02⦆,⦅D03⦆,⦅D04⦆,⦅D05⦆,⦅mrk_case_modifier_C⦆,⦅mrk_case_modifier_L⦆,⦅mrk_case_modifier_U⦆,⦅mrk_case_modifier_M⦆,⦅mrk_case_modifier_N⦆,⦅mrk_begin_case_region_C⦆,⦅mrk_begin_case_region_L⦆,⦅mrk_begin_case_region_U⦆,⦅mrk_begin_case_region_M⦆,⦅mrk_begin_case_region_N⦆,⦅mrk_end_case_region_C⦆,⦅mrk_end_case_region_L⦆,⦅mrk_end_case_region_U⦆,⦅mrk_end_case_region_M⦆,⦅mrk_end_case_region_N⦆" character_coverage=1 input_sentence_size=10000000 num_threads=16 --size_multiple 8 --save_vocab vocab/base corpus.combined
Here is my lower_tokenization.yml
:
type: OpenNMTTokenizer
params:
mode: none
case_markup: true
spacer_annotate: true
soft_case_region: true
preserve_placeholders: true
preserve_segmented_tokens: true
#segment_case: true
#segment_numbers: true
So, with this configuration, I think I'm using "Mode 1" and all options are ignored; the sp model and vocab are built, but the user-defined symbols are not added to the vocab, which confuses me. These symbols are not included in the corpus, but this is not a problem when using sentencepiece directly to create a model and vocab --it adds the user-defined symbols even when not present in the training corpus.
When I change mode
to anything else in my config (aggressive
, conservative
, etc), onmt-build-vocab
refuses to run and throws this:
tokenizer = tokenizers.make_tokenizer(args.tokenizer_config)
File "/home/panos/venv36/lib/python3.6/site-packages/opennmt/tokenizers/tokenizer.py", line 322, in make_tokenizer
tokenizer = tokenizer_class(**tokenizer_params)
File "/home/panos/venv36/lib/python3.6/site-packages/opennmt/tokenizers/opennmt_tokenizer.py", line 23, in __init__
self._tokenizer = pyonmttok.Tokenizer(**kwargs)
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
1. pyonmttok._ext.Tokenizer(tokenizer: pyonmttok._ext.Tokenizer)
2. pyonmttok._ext.Tokenizer(mode: str, *, bpe_model_path: str = '', bpe_vocab_path: str = '', bpe_vocab_threshold: int = 50, bpe_dropout: float = 0, vocabulary_path: str = '', vocabulary_threshold: int = 0, sp_model_path: str = '', sp_nbest_size: int = 0, sp_alpha: float = 0.1, joiner: str = '■', joiner_annotate: bool = False, joiner_new: bool = False, spacer_annotate: bool = False, spacer_new: bool = False, case_feature: bool = False, case_markup: bool = False, soft_case_regions: bool = False, no_substitution: bool = False, preserve_placeholders: bool = False, preserve_segmented_tokens: bool = False, segment_case: bool = False, segment_numbers: bool = False, segment_alphabet_change: bool = False, support_prior_joiners: bool = False, segment_alphabet: object = None)
Invoked with: kwargs: mode='aggresive', case_markup=True, spacer_annotate=True, soft_case_region=True, preserve_placeholders=True, preserve_segmented_tokens=True
If I get it correctly, "Mode 2" requires using any other mode except none
. So, how would you advise to train sentencepiece with onmt-build-vocab
in order to get case_markup
and all the other nice things from aggressive
, if possible?
Thanks for your patience and your help.
So, with this configuration, I think I'm using "Mode 1"
Sorry for the confusion but when I said "When a pretokenizer is set", it's whenever the option --tokenizer_config
is set. It's easier to explain this way. So this configuration should trigger "Mode 2".
When I change mode to anything else in my config (aggressive, conservative, etc), onmt-build-vocab refuses to run and throws this:
There is a typo in your config: it should be soft_case_regions
not soft_case_region
.
I'm following this thread with a lot of interest, many thanks @guillaumekln and @panosk.
So, if I understand well, it should be possible to pretokenise raw data using the aggressive mode, then create SP vocabs from that pretokenised data, then use the converted vocabs to segment text for training and inference with the OpenNMT tokeniser. I also understand this can be done manually or via the script.
However, I suppose that for the aggressive mode to work as expected when tokenising/detokenising, one should apply joiner annotation; otherwise, I see many possible ambiguity cases when detokenising. On the other hand, if a SP model is used, the tokens are generated with the spacer annotation by default, which is incompatible with the joiner annotation according to the doc.
Am I right? Or applying the aggressive mode does not need joiner annotation at all, and therefore, is fully compatible with using SP vocab models? Otherwise, could this be solved by applying different parameters when pretokenising for vocab creation and pretokenising for training/inference?
Hi @dmar1n ,
You can use the option spacer_annotate
in which case the joiner is the same symbol used by sentencepiece.
@guillaumekln ,
Apologies for the naive typo, indeed now I can use aggressive
to build the sentencepiece model and vocab. However, the user-defined symbols are not included, as they have 0 frequency. Maybe a condition may be added when extracting the N most frequent tokens to keep entries with 0 frequency, as these tokens will be only meta-tokens. Then again, why is that extra step needed? I mean, doesn't the vocab created by sentencepiece already contain the most frequent tokens?
@dmar1n Joiner and spacer annotation is a postprocessing step, so it can work with any tokenization modes:
$ echo "Hello World!" | cli/tokenize --mode aggressive --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode aggressive --spacer_annotate
Hello ▁World !
$ echo "Hello World!" | cli/tokenize --mode none --sp_model_path ~/data/wmt_ende/wmtende.model
▁H ello ▁World !
$ echo "Hello World!" | cli/tokenize --mode none --sp_model_path ~/data/wmt_ende/wmtende.model --joiner_annotate
H ■ello World ■!
On the other hand, if a SP model is used, the tokens are generated with the spacer annotation by default, which is incompatible with the joiner annotation according to the doc.
When you use SentencePiece via the OpenNMT Tokenizer, the spacers are removed internally and converted into metadata so that we can later decide if we want to inject joiners or spacers.
From the user perspective, using a pretokenization with SentencePiece should be the same as using a pretokenization with BPE.
@panosk
Then again, why is that extra step needed? I mean, doesn't the vocab created by sentencepiece already contain the most frequent tokens?
This extra step is needed because the internal SentencePiece vocabulary is invalid when using a pretokenization. The basic example is when you want to use joiner annotation with SentencePiece: the SentencePiece internal vocabulary will contain spacers, but the external vocabulary should include joiners. This is why we need to get the vocabulary from the training data, and not from the SentencePiece internal representation.
But I'm not sure to understand the use case of user-defined symbols with 0 frequency. If they are not in the tokenized training data why should they appear in the vocabulary?
Thanks for the explanations @guillaumekln , I see.
But I'm not sure to understand the use case of user-defined symbols with 0 frequency. If they are not in the tokenized training data why should they appear in the vocabulary?
I'm adding these symbols later for training the NMT model and for inference, at least that was the case when I was using sentencepiece directly --I may have to adapt it now, no big deal. Anyway, I'll run a few iterations with the resulting sp model and vocab and see how it goes.
After a few tests, I can confirm that the user-defined symbols must be included in the vocab. Apart from any custom symbols (which can be included in the corpus for training the sp model), the major problem is with the case markup symbols which cannot be included in the training corpus beforehand but should be in the vocab anyway, otherwise casing doesn't work and there are countless <unks>
s in their place.
Just to make sure that I'm not doing anything wrong from my part, after creating the sp model and vocab, I used the same tokenization .yml config for the actual NMT training, with the extra option sp_model_path: /path_to_sp.model
To complete @panosk comments, I have also run some tests with the same idea (applying aggressive mode with case markup as pretok and SentencePiece as vocab model).
I first tried manually by building the SentencePiece model on pretokenised text (which already included special symbols). This sort of worked (no errors), but I had the same problem as @panosk: the predictions had many unks
, presumably related to the aggressive tokenisation.
With the script, I managed to reduce the amount of unks
a lot, but there are still some in the evaluation predictions. This does not seem to impact the quality too much, but I cannot explain from where these unks
come from, since the validation data should be fully covered by the vocab.
On the other hand, I wonder if this is somehow an inevitable side effect of using pretokenised data with the aggressive mode, and then maybe the replace_unk
would help.
Concretely, I'm creating the vocabs with the script and the following --tokenizer_config
:
type: OpenNMTTokenizer
params:
case_markup: true
joiner_annotate: false
mode: aggressive
segment_alphabet_change: true
segment_case: true
segment_numbers: true
spacer_annotate: true
support_prior_joiners: false
@panosk
the major problem is with the case markup symbols which cannot be included in the training corpus beforehand
When you tokenise the data for training, do you pretokenise using the OpenNMT tokeniser? This should add the case markup symbols to the training data. At least, this worked for me.
The case markup symbols should be included in the vocabulary. I just the try building the following dummy vocabulary to make sure it works:
$ echo "Hello world!" > tmp.txt
$ onmt-build-vocab --sentencepiece --size 12 --tokenizer_config '{"mode": "aggressive", "case_markup": true, "joiner_annotate": true}' --save_vocab output tmp.txt
$ cat output.vocab
<blank>
<s>
</s>
■l
■o
⦅mrk_case_modifier_C⦆
h
■e
w
■r
■d
■!
I confirm the case markup tokens are included in the vocabulary. These are the first lines of my target vocab:
<blank>
<s>
</s>
⦅mrk_case_modifier_C⦆
▁de
,
▁la
'
.
▁l
’
▁et
▁les
▁des
▁à
⦅mrk_begin_case_region_U⦆
⦅mrk_end_case_region_U⦆
And indeed, the predictions include the symbols.
Here is an example of prediction with unk
:
⦅PH⦆ ⦅mrk_case_modifier_C⦆ <unk> ▁dossier :
⦅PH⦆ dossier ▁d ’ enquête :
⦅PH⦆ ⦅mrk_case_modifier_C⦆ inquiry ▁file :
In this sentence, ⦅PH⦆
is a custom symbol correctly predicted.
Well... I was using a lowercased version of my corpus with onmt-build-vocab
(facepalm). This explains the absence of the case-markup symbols from the vocab but it still doesn't explain the plethora of <unk>
s, as @dmar1n notices too. Once I realized I had been using the lowercased version of my corpus, I was almost certain a new test will show much more promising results, but I was surprised to see the amount of <unk>
s and the model performance were not affected by much (at least for the first few thousand steps). As a comparison, using a vanilla sentencepiece model and vocab (which is just converted to the proper format with onmt-build-vocab
) gives 0 <unk>
s even in the very first evaluation step. Now the amount of sentences containing at least 1 unk
accounts for ~20% of the total number of the predictions.
I also wonder if the increased amount of <unk>
s is the price we have to pay for getting case handling.
But this is really strange, because <unk>
s don't make sense in validation data, which is necessarily covered by the vocab. And moreover, the <unk>
s seem to appear instead of normal words/tokens. With SentencePiece/BPE, the only <unk>
s possible should be very rare characters not covered by the vocab.
I'm editing this post, as the example I gave was not exact. Here is a real case:
⦅PH⦆ ⦅mrk_case_modifier_C⦆ create ▁a ▁structured ▁interview
⦅PH⦆ ⦅mrk_case_modifier_C⦆ <unk> ▁un ▁entretien ▁structuré
⦅PH⦆ ⦅mrk_case_modifier_C⦆ créer ▁un ▁entretien ▁structuré
The source vocab has create
and ▁create
The target vocab has créer
and ▁créer
When training the SentencePiece model, do you set the input_sentence_size
option?
With SentencePiece/BPE, the only
s possible should be very rare characters not covered by the vocab.
That's only true for plain SentencePiece. When using a pretokenization with either SentencePiece or BPE, <unk>
s are possible depending on the data distribution when generating the vocabulary.
I'm just not sure sure why the <unk>
frequency is so high. In particular I don't see how the example above can happen if all expected tokens are in the vocabulary.
I understand the initial goal of this issue is to train case insensitive SentencePiece models. We might need to think of a different approach that does not involve a full pretokenization.
When training the SentencePiece model, do you set the input_sentence_size option?
Yes, but with a value in the order of millions. Otherwise, data is monolingual, of good quality and deduplicated.
In particular I don't see how the example above can happen if all expected tokens are in the vocabulary.
Actually, the example had the unk
at 5k steps, but it corrected itself in a subsequent prediction. In general, I noticed that the number of unks
is reduced as the training goes on. However, sentences with one or more unks
still remain even after a significant number of steps (at 17k steps, I counted 209 sentences with unk
out of 2k validation lines with BLEU scores already plateauing).
After a number of tests, I can confirm what @panosk pointed out: the issue seems to be linked to a non-alphabetic character preceding the token, such as apostrophes, parenthesis, etc.
To give you another more representative example (at 17k steps):
▁children ▁( unaccompanied ▁or ▁with ▁their ▁families )
▁les ▁enfants ▁( <unk> ▁ou ▁avec ▁leur ▁famille )
▁les ▁enfants ▁( seuls ▁ou ▁accompagnés ▁de ▁leur ▁famille )
In this case, ▁unaccompanied
and ▁seuls
are in the vocabs, but not their variants without the spacer.
Yes, but with a value in the order of millions.
Just to note that when using a pretokenization, input_sentence_size
corresponds to a number of words, since the SentencePiece model is trained at the word-level and not the sentence-level.
After a number of tests, I can confirm what @panosk pointed out: the issue seems to be linked to a non-alphabetic character preceding the token, such as apostrophes, parenthesis, etc.
Maybe using joiner_annotate
instead could improve the situation?
Just to note that when using a pretokenization, input_sentence_size corresponds to a number of words, since the SentencePiece model is trained at the word-level and not the sentence-level.
You are right, but I was careful with that. So while with a normal sentencepiece training in sentence level I set a limit of 10M sentences, with pretokenization I set a limit of 300M (tokens) which should be enough --at least that's a safe high limit for 64GB of RAM.
Maybe using joiner_annotate instead could improve the situation?
That's a good idea, I'll try it asap!
Thanks a lot for the hints, @guillaumekln. I was indeed using a value of 10M. I will remove that argument and limit the initial corpus beforehand to 10M lines.
Regarding the joiner annotation, this was my initial idea when I first intervened in the thread. Unfortunately, when using joiner annotation, I got some incompatibility error with SentencePiece models. I will try again, though.
Here you have some updates. I have tried with the joiner annotation. The vocabs are correctly created (there are the expected joiners and no spacers). But when I tokenise the training data, I get the following error:
ValueError: SentencePiece vocabulary restriction requires the tokenization to use "spacer_annotate" (same as spm_encode)
If I then change the config to have spacer annotation (using the vocabs correctly created with the joiners), I get extremely segmented data, which is normal given that the vocab does not have any spacer.
I see.
At this point why not using BPE? Since managing case with SentencePiece currently requires a pretokenization (could be improved in the future), it seems there is little benefit over BPE. From experience the following BPE tokenization should work well in many cases:
pyonmttok.Tokenizer(
"aggressive",
bpe_model_path=...,
vocabulary_path=...,
joiner_annotate=True,
case_markup=True,
soft_case_regions=True,
preserve_placeholders=True,
preserve_segmented_tokens=True,
segment_case=True,
segment_numbers=True,
segment_alphabet_change=True,
)
Thanks for the config sample! I see there are options in that configuration that I was not specifying in my tests.
And for clarification, I have been using BPE as a subword model via SentencePiece all the time. I referred to SentencePiece just as the library used to subtokenise, which I configure via the option --sentencepiece model_type=bpe
.
Update: I think I understand better now. So, the simplest way to proceed would be to create a BPE model, or a BPE-based tokeniser using the Python wrapper, with the required OpenNMT tokeniser options. This should indeed simplify the process a lot. I will try this approach and let you know. Many thanks again for your help!
Yes, I meant using the BPE implementation in the Tokenizer. The BPE training is not integrated in onmt-build-vocab
, but it should be fairly easy to use the Python API to train the model, apply it on the training data, and then build the vocabulary.
@guillaumekln , I know this gets a bit off, but could you please verify the below steps for using BPE? I've been using sentencepiece since forever and all my code is adapted to it, but I really need case handling so I'll test BPE extensively.
case
and other options I wantsubword-nmt learn-joint-bpe-and-vocab
with both training filesonmt-build-vocab --from_vocab bpe-vocab.{src,tgt} --save_vocab onmt-vocab.{src,tgt} --size_multiple 8
//I keep the vocab sizes as resulted from subword-nmt
@@
with the joiner symbol in the vocabsThanks in advance!
I recommend training the BPE model with the Tokenizer directly. It will take care of many details and ensure consistency. Here's a basic workflow:
import pyonmttok
tokenizer = pyonmttok.Tokenizer(
"aggressive",
joiner_annotate=True,
case_markup=True,
soft_case_regions=True,
preserve_placeholders=True,
preserve_segmented_tokens=True,
segment_case=True,
segment_numbers=True,
segment_alphabet_change=True,
)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)
learner.ingest_file("train.txt")
tokenizer = learner.learn("bpe.model")
tokenizer.tokenize_file("train.txt", "train.txt.tok", num_threads=4)
Then build the vocabulary from train.txt.tok
:
onmt-build-vocab --save_vocab bpe.vocab train.txt.tok
(Note: symbols=32000
is the number of BPE merge operations, and not the vocabulary size. There will probably be more unique tokens in the tokenized data.)
Finally you can either train directly on train.txt.tok
without configuring the tokenization .yml files, or re-tokenize train.txt
using the BPE model and vocabulary restriction (the vocabulary_path
argument).
Let's try not to diverge too much from the initial issue. For further discussion about BPE, please consider opening a topic on the forum.
Thanks a lot!
Let's try not to diverge too much from the initial issue. For further discussion about BPE, please consider opening a topic on the forum.
Absolutely!
I followed the approach suggested to build vocabs and tokenise training data. Until here, everything works like a charm. After 15k training steps, though, there are still many <unk>
s, but now with a different pattern: the <unk>
s appear between digits or special symbols. After some analysis, it seems that these <unk>
s correspond to joiners that end up between those characters. Here is an example:
(■ 2 ■ 0 ■ 1 ■ 9 ■)
(■ 2 <unk> 0 <unk> 1 <unk> 9 ■)
As you can see, each parenthesis has its joiner attached, while the numbers have spaces around; unfortunately, all indicates that these orphaned joiners are systematically replaced with <unk>
s in the predictions. Interestingly enough, this does not happen with alphabetic or punctuation tokens.
I replicated the proposed settings/workflow line by line, but maybe I missed an important option here? Otherwise, it shouldn't be difficult to fix this issue in a postprocessing step, but I guess it would be better to find the root cause first. I will look at it and let you know if I find anything relevant.
Hi @dmar1n ,
If you followed the steps for using BPE directly in the tokenizer with no sentencepiece involvement, I can confirm that it works like a charm and I get 0 <unk>
s right from the start, so maybe you missed sth.
As @guillaumekln noted, we are getting off track from the initial issue, so feel free to post your last comment in the forum and we can continue there.
Thanks, @panosk, it's good to know that it works for you. I confirm I followed the exact workflow and options suggested. Also note that the issue remains the same for me; that is, not being able to use case markup in any configuration with subword tokenisation. Anyway, I will give it another try and post the issue in the forum, if still unresolved. Thanks both for your help!
Just a quick update. The suggested solution did work eventually. I think it was a problem of the versions installed. With the latest versions, it works great. Thanks again!
To get back to the initial issue and request, case_markup
with "true" SentencePiece would definitely be useful. But I still did not find a good solution that ticks all the boxes:
So I'm not sure it is possible to effectively implement this outside of SentencePiece. If you have any ideas, please let me know.
When using
case_markup
inspace
/none
mode, unexpected behavior happens:As you can see,
.detokenize
can not rebuild the original text. Same behavior exists forspace
.While mode
conservative
oraggressive
does not suffer this issue. But the result compare to nocase_markup
is not consistent, as they split the text to insert markup placeholder.