What does this PR do?

This PR adds the possibility of NOT reusing the last target of the previous sample as the first token of the next sample during training. This is needed for Instruction Tuning since, in this case, we have to avoid overlapping training samples (in contrast to the pretraining).

In addition, we provide example configurations for preparing the data and running the training.

Checklist before submitting final PR

[ x] My PR is minimal and addresses one issue in isolation
[x ] I have merged the latest version of the target branch into this feature branch
[x ] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[x ] I have run a sample config for model training
[ x] I have checked that all tests run through (python tests/tests.py)

Modalities / modalities

SFT sample generator #193

What does this PR do?

Checklist before submitting final PR