Introducing a new entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.
In modalities training entry point you can now wrap the collate function by a "LossMaskingCollateFn", which first executes the wrapped collate function and then applies loss masking on each target as specified in the config. This allows to only include tokens that are part of the assistant answer into the loss, so that the model learns to act as helpful assistant.
Modifiy the PackedMemMapDatasetContinuous to allow not to re-use the last target token, as this is not wanted in instruction-tuning where we apply truncation and packing.
General Changes
New entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was applied
A wrapper for collate functions to include tokens which appear between indicator tokens
A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token
Breaking Changes
None, as the default value for PackedMemMapDatasetContinuous.reuse_last_target is True
Checklist before submitting final PR
[X] My PR is minimal and addresses one issue in isolation
[X] I have merged the latest version of the target branch into this feature branch
[X] I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
[X] I have run a sample config for model training
[X] I have checked that all tests run through (python tests/tests.py)
What does this PR do?
This PR adds support for instruction tuning, by
data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml
which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.General Changes
data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml
to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was appliedBreaking Changes
PackedMemMapDatasetContinuous.reuse_last_target is True
Checklist before submitting final PR
python tests/tests.py
)