lllAlexanderlll commented 2 months ago

What does this PR do?

This PR adds support for instruction tuning, by

Introducing a new entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.
In modalities training entry point you can now wrap the collate function by a "LossMaskingCollateFn", which first executes the wrapped collate function and then applies loss masking on each target as specified in the config. This allows to only include tokens that are part of the assistant answer into the loss, so that the model learns to act as helpful assistant.
Modifiy the PackedMemMapDatasetContinuous to allow not to re-use the last target token, as this is not wanted in instruction-tuning where we apply truncation and packing.

General Changes

New entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was applied
A wrapper for collate functions to include tokens which appear between indicator tokens
A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token

None, as the default value for PackedMemMapDatasetContinuous.reuse_last_target is True

[X] My PR is minimal and addresses one issue in isolation
[X] I have merged the latest version of the target branch into this feature branch
[X] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[X] I have run a sample config for model training
[X] I have checked that all tests run through (python tests/tests.py)

lllAlexanderlll commented 2 months ago

Please merge the fix for main, so that the tests run all through in main first, by merging in PR #195

lllAlexanderlll commented 2 months ago

Added e2e test and a new feature request #210