facebookresearch / fairseq2

FAIR Sequence Modeling Toolkit 2
https://facebookresearch.github.io/fairseq2/
MIT License
613 stars 59 forks source link

Add `target_twoway` mode in the NLLB tokenizer #602

Open avidale opened 1 week ago

avidale commented 1 week ago

What does this PR do? Please describe:

This PR introduces a target_twoway mode in the NLLB tokenizer. It represents the target text as </s>__lang_code__ The text.</s>.

This mode is handy for training translation models (like NLLB) or text decoding models (like SONAR decoders). With this mode, all-but-the last tokens (e.g. </s>__lang_code__ The text.) can be used as teacher-forced inputs to the decoder using its training, and all-but-the-first tokens (e.g. __lang_code__ The text.</s>) can be used as the next-token targets for the same decoder. Which is how NLLB (and also SONAR) models are supposed to be trained.

Without this change, the similar result could be achieved in less straightforward ways:

I believe that by having a single token sequence and selecting the right tokens from it to be decoder inputs or decoder targets is the simplest strategy, so this new tokenization mode is introduced to enable it.

Does your PR introduce any breaking changes? If yes, please list them: None

Check list: