LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.92k stars 3.22k forks source link

Add dataset loader for MegaCodeTraining112k & Evol-Instruct-Code-80k-v1 #3605

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

Added code to load rombodawg/MegaCodeTraining112k (key: megacode) and nickrosh/Evol-Instruct-Code-80k-v1 (key: evol_instruct_code). Also added an optional fill_min_length parameter to InstructionDataset class. If specified instructions are concatenate until the total string length of prompts and completions exceeds fill_min_length. Seed for random order can optionally be specified (default: 42).

Example:

  datasets:
    - megacode:
        fill_min_length: 24000
    - evol_instruct_code:
        fill_min_length: 24000