This PR introduces a target_twoway mode in the NLLB tokenizer. It represents the target text as </s>__lang_code__ The text.</s>.
This mode is handy for training translation models (like NLLB) or text decoding models (like SONAR decoders). With this mode, all-but-the last tokens (e.g. </s>__lang_code__ The text.) can be used as teacher-forced inputs to the decoder using its training, and all-but-the-first tokens (e.g. __lang_code__ The text.</s>) can be used as the next-token targets for the same decoder. Which is how NLLB (and also SONAR) models are supposed to be trained.
Without this change, the similar result could be achieved in less straightforward ways:
either by tokenizing the decoder inputs in the target mode and the decoder outputs in the source mode, and passing both the sequences to the training step, and truncating/collating/padding both of them identically - which seems slightly brittle;
or by tokenizing the decoder inputs in the target mode and then turning them into decoder targets by removing the first </s> token and adding the one after the text - which is not an easy operation on a padded batch of different-length sequences, and may add an </s> token incorrectly to a sequence that has been truncated;
or by tokenizing the decoder targets in the source mode and then turning them into decoder inputs by removing the last token (either </s> or pad token) and adding an </s> token in the beginning - which is actually fine (e.g. the transformers implementation of NLLB does it with their shift_tokens_right function), but it requires either concatenating tensors or copying and filling them in place, which is slightly less efficient then just selecting all-but-the-first or all-but-the-last tokens.
I believe that by having a single token sequence and selecting the right tokens from it to be decoder inputs or decoder targets is the simplest strategy, so this new tokenization mode is introduced to enable it.
Does your PR introduce any breaking changes? If yes, please list them:
None
Check list:
[ ] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
What does this PR do? Please describe:
This PR introduces a
target_twoway
mode in the NLLB tokenizer. It represents the target text as</s>__lang_code__ The text.</s>
.This mode is handy for training translation models (like NLLB) or text decoding models (like SONAR decoders). With this mode, all-but-the last tokens (e.g.
</s>__lang_code__ The text.
) can be used as teacher-forced inputs to the decoder using its training, and all-but-the-first tokens (e.g.__lang_code__ The text.</s>
) can be used as the next-token targets for the same decoder. Which is how NLLB (and also SONAR) models are supposed to be trained.Without this change, the similar result could be achieved in less straightforward ways:
target
mode and the decoder outputs in thesource
mode, and passing both the sequences to the training step, and truncating/collating/padding both of them identically - which seems slightly brittle;target
mode and then turning them into decoder targets by removing the first</s>
token and adding the one after the text - which is not an easy operation on a padded batch of different-length sequences, and may add an</s>
token incorrectly to a sequence that has been truncated;source
mode and then turning them into decoder inputs by removing the last token (either</s>
or pad token) and adding an</s>
token in the beginning - which is actually fine (e.g. thetransformers
implementation of NLLB does it with their shift_tokens_right function), but it requires either concatenating tensors or copying and filling them in place, which is slightly less efficient then just selecting all-but-the-first or all-but-the-last tokens.I believe that by having a single token sequence and selecting the right tokens from it to be decoder inputs or decoder targets is the simplest strategy, so this new tokenization mode is introduced to enable it.
Does your PR introduce any breaking changes? If yes, please list them: None
Check list: