Add flexible configuration options for `chat_template` dataset training

Tostino commented 1 month ago

Introduce roles_to_train parameter to set training labels by role
Add train_on_eos option to configure training on end-of-sequence tokens
Implement per-message training configuration in dataset
Allow fine-grained control over training specific portions of messages
Add message_field_training and message_field_training_detail settings
Implement mapping between dataset character offsets and tokenized prompt
Enhance test suite to cover new functionality

Motivation and Context

This change is required to provide more granular control over the training process, particularly for fine-tuning models on specific roles and message components. It solves the problem of overly broad training that may not focus on the most relevant parts of the dataset.

These enhancements will enable researchers and developers to create more targeted and efficient training workflows, potentially leading to better model performance on specific tasks or domains.

How has this been tested?

The changes have been tested locally using a comprehensive test suite. The enhanced test suite now covers all new functionality, including:

Unit tests for the new parameters
Validation of per-message training configurations
Tests for fine-grained control over message portion training
Accuracy checks for the mapping between dataset character offsets and tokenized prompts

Types of changes

[x] New feature (non-breaking change which adds functionality) [o] Breaking change (fix or feature that would cause existing functionality to not work as expected) [o] Documentation update [o] Performance enhancement [o] Code cleanup or refactor [o] Dependency update

Tostino commented 1 month ago

@winglian Just FYI, in the tests, I changed the model because NousResearch/Meta-Llama-3-8B-Instruct has bad tokenizer configurations.

Using that model will cause some of my tests to fail because of an incorrectly configured EOS token.

I mentioned it to them in Discord, so hopefully they fix it and tests can start working.

hammoudhasan commented 1 month ago

@Tostino I tried this code it works alright however one thing to fix imo is the fact that no error was thrown out when I had the key:

message_field_training: training

but didn't have it in the JSON file. It still tokenized and all.

Tostino commented 2 weeks ago

Just for reference later... @hammoudhasan that is entirely intentional. That field is meant to override whatever other settings you already have. So it let's you set the row to train when it otherwise wouldn't, or not train when it otherwise would...e.g. a conversation where the assistant is learning to correct from errors it has made, and you don't want to train on the known errors.

So it's not expected to be on every row of training data. Same with message_field_training_detail.

axolotl-ai-cloud / axolotl