axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.48k stars 808 forks source link

Add flexible configuration options for `chat_template` dataset training #1756

Closed Tostino closed 1 month ago

Tostino commented 1 month ago

Motivation and Context

This change is required to provide more granular control over the training process, particularly for fine-tuning models on specific roles and message components. It solves the problem of overly broad training that may not focus on the most relevant parts of the dataset.

These enhancements will enable researchers and developers to create more targeted and efficient training workflows, potentially leading to better model performance on specific tasks or domains.

How has this been tested?

The changes have been tested locally using a comprehensive test suite. The enhanced test suite now covers all new functionality, including:

  1. Unit tests for the new parameters
  2. Validation of per-message training configurations
  3. Tests for fine-grained control over message portion training
  4. Accuracy checks for the mapping between dataset character offsets and tokenized prompts

Types of changes

[x] New feature (non-breaking change which adds functionality) [o] Breaking change (fix or feature that would cause existing functionality to not work as expected) [o] Documentation update [o] Performance enhancement [o] Code cleanup or refactor [o] Dependency update

Tostino commented 1 month ago

@winglian Just FYI, in the tests, I changed the model because NousResearch/Meta-Llama-3-8B-Instruct has bad tokenizer configurations.

Using that model will cause some of my tests to fail because of an incorrectly configured EOS token.

I mentioned it to them in Discord, so hopefully they fix it and tests can start working.

hammoudhasan commented 1 month ago

@Tostino I tried this code it works alright however one thing to fix imo is the fact that no error was thrown out when I had the key:

message_field_training: training

but didn't have it in the JSON file. It still tokenized and all.

Tostino commented 2 weeks ago

Just for reference later... @hammoudhasan that is entirely intentional. That field is meant to override whatever other settings you already have. So it let's you set the row to train when it otherwise wouldn't, or not train when it otherwise would...e.g. a conversation where the assistant is learning to correct from errors it has made, and you don't want to train on the known errors.

So it's not expected to be on every row of training data. Same with message_field_training_detail.