Nicolas-BZRD / llm-recipes

10 stars 3 forks source link

What does the workflow for data preparation (via llm-distillation framework) and model distillation (llm-recipes) look like? #5

Open hieuchi911 opened 4 weeks ago

hieuchi911 commented 4 weeks ago

Hi, I'd like to have a few questions on the workflow combining llm-recipes with llm-distillation to create a teacher-generated dataset and then distilling students on the synthesized datasets.

Q1: Could you confirm whether the steps (which mostly discusses the dataset aspect of the pipeline) below, taking the example where teacher model is LLaMA2-7B, student model is EleutherAI/pythia-410m-deduped, and dataset is FairytaleQA (so the task is generative QA), is correct or not:

  1. Create teacher-generated datasets:
    • run generator.py to create a new dataset with teacher-generated predictions:
      python datasets/generator.py \
      --model_id meta-llama/Llama-2-7b-chat-hf \
      --dataset_id GEM/FairytaleQA \
      --split_name train \
      --number_few_shot 5 \
      --batch_size 4 \
      --bfloat \
      --task qa_generative \
      --mapping llm-distillation/benchmark/mapping/FairytaleQA.json
    • this will create a dataset from FairyTaleQA's train split, adding a new column called answer_generated which stores LLaMA2-7B's predictions. The dataset is saved at llm-distillation/datasets/generated/Llama-2-7b-chat-hf/FairytaleQA/train
    • do the same for validation split of FairytaleQA
  2. Run finetune.py:

    • setting --dataset.file as the path to the script that defines get_split() (which I believe looks exactly like the loader from llm-distillation/datasets/loader/fairytaleQA.py, except that the input for load_from_disk should be set to the path of the synthesized dataset llm-distillation/datasets/generated/Llama-2-7b-chat-hf/FairytaleQA/{split} where split is the dataset split created by the generator):
      
      torchrun --standalone --nproc_per_node 2 finetuning.py \
      --model_name EleutherAI/pythia-410m-deduped \
      --enable_fsdp \
      --run_validation False \
      --dataset.file llm-distillation/datasets/loader/fairytaleQA.py \
      --lr 1e-6 \
      --num_epochs 5 \
      --batch_size_training 4 \
      --val_batch_size 4 \
      --output_dir train/output/path \
      --save_step 100 \

    --distillation \ --distillation_config.model_name EleutherAI/pythia-410m-deduped \ --distillation_config.enable_fsdp \ --distillation_config.pure_bf16 \ --distillation_config.distil_factor 1.5

    
    - when distilling a student model with a teacher, and with `--dataset.file llm-distillation/datasets/loader/fairytaleQA.py`, a separate dataloader for each model will be created, in which each input to the student is a question from the original dataset (without few shots examples) and an answer that is teacher generated, and each input text to the teacher is few-shot examples and the original question and the answer it generated.

Q2: Is it that for any new dataset of a specific task (either a new task or a predefined task like qa, qa_generative, etc.), we should ensure the compatibleness of the loaders used in finetuning and llm-distillation/prompt/few_shot/task_name.py

Q3: How did you come up with the few-shot examples for creating the synthetic dataset? Did you manually compose these examples? Or did you take directly from the train/val/test set?

Q4: For summarization task, the DialogSum dataset has samples that are up to 5k tokens, which I believe can exceed the context length of Llama-2-7B (context length 4096). Did you have to cut down sequences that are longer than this value? I didn't see that in llm-distillation.

Thank you!

Nicolas-BZRD commented 2 days ago

Hey @hieuchi911, sorry for the late reply.

Q1: The recipe seems correct 👍

Q2: I’m not sure I understand. You can use the same few-shot examples that we provide for similar tasks. However, if you want to distill a completely new task, you should create some other examples to use as few-shot examples.

Q3: It depends on the task, but generally, if the dataset is larger than the number of examples used for training, we use some rows that are not used. If the dataset is smaller, we create it ourselves or synthetically with LLaMA or ChatGPT.

Q4: Concerning sentence length, we decided not to crop the size. Indeed, for some models, this could be a disadvantage, but we want to allow models with a larger context length to process the sentence. For a “fair” comparison in terms of examples seen during evaluation, we need to keep them and let the model work on its own.

hieuchi911 commented 1 day ago

Thank you very much for your clarification. Sorry for not making it clear enough for Q2. I'll rephrase the question like below:

Say I want to do distillation on the task of news summarization with cnn_dailymail dataset. Would the pipeline below correct (specifically the loader in finetuning step):

For dataset generation:

For finetuning: