What does this PR do ?

This PR makes the dpo dataset use chat format tokens from the model's config yaml instead of hardcoding chat/special tokens in the jsonl data file.

Currently, each datapoint inside a DPO jsonl data file, looks like this:

{
  "prompt": "<extra_id_0>System\n\n<extra_id_1>User\nbacillus subtilus\n<extra_id_1>Assistant\n",
  "chosen_response": "Bacillus ... and industry alike.\n<extra_id_1>",
  "rejected_response": "The Bacillus ... fields of study.\n<extra_id_1>",
  "rejected_reward": 3,
  "chosen_reward": 4
}

With this PR it should be like this (OpenAI list of messages format with no chat/formatting tokens):

{
  "prompt": [
    {
      "role": "system",
      "content": ""
    },
    {
      "role": "user",
      "content": "bacillus subtilus"
    }
  ],
  "chosen_response": {
    "role": "assistant",
    "content": "Bacillus ... and industry alike."
  },
  "rejected_response": {
    "role": "assistant",
    "content": "The Bacillus ... fields of study."
  },
  "chosen_reward": 4,
  "rejected_reward": 3
}

Additionally There is a script added to convert old data files into the new format.

python nemo_aligner/data/nlp/scripts/undo_special_tokens.py <path_to_old_format_dpo_jsonl_file>

A new file will be written in the same location as the old format file.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

[ ] Does the trainer resume and restore model state all states?
[ ] Does the trainer support all parallelism techniques(PP, TP, DP)?
[ ] Does the trainer support max_steps=-1 and validation?
[ ] Does the trainer only call APIs defined in alignable_interface.py?
[ ] Does the trainer have proper logging?

Additional Information

Related to # (issue)

NVIDIA / NeMo-Aligner

feat: support new DPO data format #405