LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.94k stars 3.22k forks source link

About pretrain data size in sft-8-datasets. #3320

Closed fengyh3 closed 1 year ago

fengyh3 commented 1 year ago

Hi, i noticed that in the config of sft-8-datasets, 5% red_pajama are added in sft training. So there are 3 question i was confused:

  1. Will the pretrain data size be more larger and the instruction data size?
  2. Will this situation affect the effect of SFT training?
  3. How you guys pick the fraction of pretrain data?
andreaskoepf commented 1 year ago

We ran the longest pre-training for SFT-8 but the unfortunately more did not mean better in this case - eval results were pretty bad, see https://tju01.github.io/ilm-eval/

The idea to add red_pajama was to continue training basic language modelling. It was also used in the 2nd stage with the hope to reduce overfitting, but the effect was not clear and the end-result was "mediocre". A new 2-stage training without red-pajama, stack-exchange & the prosocial datasets is currently running (now with LoRA).

Just for reference, the old SFT-8 stage-1 dataset config was (bad, don't use):

pretrain:
  num_train_epochs: 1
  weight_decay: 0.0
  use_custom_sampler: true
  sort_by_length: false
  datasets:
    - gpteacher_roleplay:
        val_split: 0.05
    - red_pajama:
        fraction: 0.25
        max_val_set: 1000
    - wizardlm_70k:
        val_split: 0.05
        max_val_set: 500
    - joke:
        val_split: 0.05
    - poem_instructions:
        val_split: 0.025
    - oa_stackexchange:
        val_split: 0.05
        fraction: 0.1
        max_val_set: 1000
    - tell_a_joke:
        val_split: 0.05
        max_val_set: 250
    - webgpt:
        val_split: 0.05
        max_val_set: 250
    - gpt4all:
        val_split: 0.01
        max_val_set: 1000
    - alpaca_gpt4:
        val_split: 0.025
        max_val_set: 250
    - code_alpaca:
        val_split: 0.05
        max_val_set: 250
    - vicuna:
        max_val_set: 250
    - oig_file:
        source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl
        max_count: 10000
        min_length: 250
        val_split: 0.05
        max_val_set: 250
    - minimath:
        val_split: 0.05
    - humaneval_mbpp_codegen_qa:
        val_split: 0.05
    - humaneval_mbpp_testgen_qa:
        val_split: 0.05
    - grade_school_math_instructions:
        val_split: 0.05
    - recipes:
        val_split: 0.05
    - cmu_wiki_qa:
        val_split: 0.05
    - oa_wiki_qa_bart_10000row:
        val_split: 0.05
        max_val_set: 250
    - prosocial_dialogue:
        fraction: 0.1
        max_val_set: 250
    - explain_prosocial:
        fraction: 0.075
        max_val_set: 250
    - soda:
        fraction: 0.25
        max_val_set: 1000
    - oa_leet10k:
        val_split: 0.05
        max_val_set: 250
    - dolly15k:
        val_split: 0.05
        max_val_set: 300
fengyh3 commented 1 year ago

Thanks for your reply. Moreover, is there a good fraction of pretrain data in the sft training stage? Now i used 15% pretrain data and 85% instruction tuning data in the sft training. Is there any suggesiton about the fraction between these two part of data?