OpenNMT / Tokenizer

Fast and customizable text tokenization library with BPE and SentencePiece support
https://opennmt.net/
MIT License
276 stars 69 forks source link

case_markup changes critical for trained models #274

Closed anderleich closed 2 years ago

anderleich commented 2 years ago

Hi,

I've recently run into some issues when translating text tokenized with case_markup enabled. After some manual inspection of the results I found some differences in case_markup compared to previously tokenized text in the past.

I'm concerned about these changes in the behaviour as it completely changes input sequences. As a result, we obtain different translation results too. I guess we should save the version of the tokenizer we use when traning a model just to be sure input is tokenized the same way in the future. However, we could not benefit from other code improvements done so far.

I would appreciate any suggestion regarding this issue. Thanks

guillaumekln commented 2 years ago

Hi,

Can you provide an example? Please also specify all tokenization options that are used and the versions you are comparing.

anderleich commented 2 years ago

These are some examples returned by diff:

Los tornillos guias irán en las posiciones 5H, 11H y 5C.
< ⦅mrk_case_modifier_C⦆ los torn■ illos gu■ ias irán en las posiciones 5■ ⦅mrk_case_modifier_C⦆ h ■, 11■ ⦅mrk_case_modifier_C⦆ h y 5■ ⦅mrk_case_modifier_C⦆ c ■.
---
> ⦅mrk_case_modifier_C⦆ los torn■ illos gu■ ias irán en las posiciones ⦅mrk_case_modifier_C⦆ 5■ h ■, ⦅mrk_case_modifier_C⦆ 11■ h y ⦅mrk_case_modifier_C⦆ 5■ c ■.

- Valores límite para óxidos de nitrógeno (R.D. 1073 / 2002)
< - ⦅mrk_case_modifier_C⦆ valores límite para óx■ idos de nitrógeno (■ ⦅mrk_case_modifier_C⦆ r■ ⦅mrk_case_modifier_C⦆ .■ ⦅mrk_case_modifier_C⦆ d ■. 10■ 73 / 2002 ■)
---
> - ⦅mrk_case_modifier_C⦆ valores límite para óx■ idos de nitrógeno (■ ⦅mrk_case_modifier_C⦆ r■ .■ ⦅mrk_case_modifier_C⦆ d ■. 10■ 73 / 2002 ■)

Artículo 52.Supervisión de riesgos.
< ⦅mrk_case_modifier_C⦆ artículo 5■ 2■ .■ ⦅mrk_case_modifier_C⦆ supervisión de riesgos ■.
---
> ⦅mrk_case_modifier_C⦆ artículo ⦅mrk_case_modifier_C⦆ 5■ 2.■ supervisión de riesgos ■.

Tokenization options as specified in the config file:

tgt_subword_type: bpe
src_subword_model: /path/codes.es
src_onmttok_kwargs: {"mode": "conservative", "joiner": "■", "joiner_annotate": true, "case_markup": true}

Current version: pyonmttok 1.26.4 Previous version: pyonmttok 1.25.0 I'm afraid I can't tell the exact version I used in the past as it might have been updated at some point. It could be different to 1.25.0

guillaumekln commented 2 years ago

I confirm this has changed in version 1.27.0 following this commit https://github.com/OpenNMT/Tokenizer/commit/504437f66dd9d7c807245dc78095582b54d0d242. This commit was fixing the option segment_case (enabled automatically by case_markup) that did not work properly in some situations.

Before the fix the uppercase letters from your example are not segmented (incorrect):

$ echo "Los tornillos guias irán en las posiciones 5H, 11H y 5C." | cli/tokenize --joiner_annotate --segment_case
Los tornillos guias irán en las posiciones 5H ■, 11H y 5C ■.

After the fix:

$ echo "Los tornillos guias irán en las posiciones 5H, 11H y 5C." | cli/tokenize --joiner_annotate --segment_case
Los tornillos guias irán en las posiciones 5■ H ■, 11■ H y 5■ C ■.

In general we make sure the tokenization is unchanged in new versions but bugs can happen and we have to fix them.

I'm concerned about these changes in the behaviour as it completely changes input sequences. As a result, we obtain different translation results too.

How is the quality of these new translations? Since the new tokenization is not introducing new patterns, I guess the model can still work reasonably well on the new inputs.

anderleich commented 2 years ago

I've done some more tests and as you said it seems it is fixed. With the latest version I obtain the same tokenization results as I did with 1.18.5 (In my previous comment I wrote 1.25.0 which was wrong).

Version 1.18.5: ⦅mrk_case_modifier_C⦆ los torn■ illos gu■ ias irán en las posiciones 5■ ⦅mrk_case_modifier_C⦆ h ■, 11■ ⦅mrk_case_modifier_C⦆ h y 5■ ⦅mrk_case_modifier_C⦆ c ■. Version 1.26.4: ⦅mrk_case_modifier_C⦆ los torn■ illos gu■ ias irán en las posiciones ⦅mrk_case_modifier_C⦆ 5■ h ■, ⦅mrk_case_modifier_C⦆ 11■ h y ⦅mrk_case_modifier_C⦆ 5■ c ■. Version 1.30.0 (newest): ⦅mrk_case_modifier_C⦆ los torn■ illos gu■ ias irán en las posiciones 5■ ⦅mrk_case_modifier_C⦆ h ■, 11■ ⦅mrk_case_modifier_C⦆ h y 5■ ⦅mrk_case_modifier_C⦆ c ■.

Version 1.18.5: ⦅mrk_case_modifier_C⦆ artículo 5■ 2■ .■ ⦅mrk_case_modifier_C⦆ supervisión de riesgos ■. Version 1.26.4: ⦅mrk_case_modifier_C⦆ artículo ⦅mrk_case_modifier_C⦆ 5■ 2.■ supervisión de riesgos ■. Version 1.30.0 (newest): ⦅mrk_case_modifier_C⦆ artículo 5■ 2■ .■ ⦅mrk_case_modifier_C⦆ supervisión de riesgos ■.

Thanks for your time!

anderleich commented 2 years ago

Regarding the impact of the different tokenizations in the output, I agree the model can still work reasonably well on the new inputs as only small changes are expected. However, those subtle changes have an impact in BLEU scores being critical when comparing different models trained with different tokenizations (for example, when comparing the results with an older model).

For example: BLEU with 1.25.0 version: 23.01 BLEU with 1.30.0 version: 22.89

It is not a huge change but it can be significant enough if changes expected by a new experiment are small.

guillaumekln commented 2 years ago

I don't think we can draw conclusion from this single data point.

This tokenization change is small and only affects specific patterns, so it will not have a noticeable impact on the final quality.

anderleich commented 2 years ago

Hi @guillaumekln,

I've run some more tests over 7 different test sets I usually use to test model performance. These are the results:

BLEU with 1.25.0 version: 23.01 | 14.33 | 22.23 | 37.90 | 28.71 | 35.55 | 51.39 BLEU with 1.30.0 version: 22.89 | 14.21 | 22.09 | 37.77 | 28.50 | 35.40 | 50.98

It consistently performs worse on all the test sets. This is expected though, as some patterns have been changed since it was trained. I agree it will not have a noticeable impact on the final quality of the translations. However, I think it is worth having it clear in mind, specially when comparing different experiements.

guillaumekln commented 2 years ago

Just to be clear, did you retrain a model with 1.30.0 and compared against a model trained with 1.25.0? Or did you just update the tokenizer at inference time?

In the first case, you should run multiple trainings to have a better comparison. Due to the random nature of the training, it can happen than a model is slightly behind another one for the same number of training steps.

In the second case, the BLEU changes are mostly expected.

anderleich commented 2 years ago

I just updated the tokenizer at inference time. It's just about the inconsistencies during training and inference.

guillaumekln commented 2 years ago

Ok, that makes sense. In https://github.com/OpenNMT/Tokenizer/issues/274#issuecomment-1020079756, you mentioned "comparing different models trained with different tokenizations" which got me confused.

anderleich commented 2 years ago

Yes, that would be the case where you train a new model, say with better data, and you want to compare the baseline (trained using a previous Tokenizer version) and the newly trained model (trained with the newest version). In such a case, these differences in the output could play a mayor role in the comparison if changes expected with the incorporation of the new data to the training process were already small.

I don't know if I made myself clear enough :)

guillaumekln commented 2 years ago

When training a new model from scratch using the latest Tokenizer version, we don't expect the difference described in https://github.com/OpenNMT/Tokenizer/issues/274#issuecomment-1019845168 to have a visible impact on the final score. Here the issue is mostly about running inference with a model trained using an older Tokenizer version.

In general we try to keep the tokenization unchanged as much as possible, but fixing bugs or even updating the ICU library can introduce small differences. To ensure the tokenization is fully identical, you may need to freeze the Tokenizer version in your experiment.