Roadmap - Githubissues

deep-diver commented 5 months ago

For visibility cc'ing: @sayakpaul

Tech stack

Gemma 2B/7B models
- Hugging Face(HF) transformers for modeling
- HF peft for parameter efficient fine-tuning
- HF text-generation-inference for serving
- HF Hub for versioning fine-tuned models
- HF trl for streamlining SFT fine-tuning process

Pipeline

(manual) Manually launching the LLM fine-tuning
(automate) Evaluating the fine-tuned LLM
(conditional) based on the results from the step 2.
- (automate) if the performance is any better, replace the existing model on the target HF Model Repo
- (manual) if the performance is bad, run a script to generate synthetic data that look similar to the existing ones
  - Go back to the step 1
(automate) from step 3, if the performance is determined to be good enough, the currently serving model by TGI will be replaced accordingly.
- we assume TGI is running on somewhere in the cloud like dstack or HF Space repo with Hugging Chat.

deep-diver commented 5 months ago

Action items

0. dataset preparation

the target dataset is the filtered no-robots dataset with only Coding category.
the no-robots dataset comes with train and test splits, so we are going to use the train split for fine-tuning and test split for evaluating the fine-tuned model.

1. fine-tuning script

this script should look similar to the standard transformers + peft + trl usage (guide link1, link2).
additionally, we could add an option to run fine-tuning process on Vertex AI (guide link).
(optional) running the script on dstack

2. evaluation script

this script leverages service LLM (Gemini, ChatGPT, ...) to evaluate fine-tuned model on the test split.
each item in the test split contains messages which is basically a pair of user and assistant's contents.
- this script let the fine-tuned model generate output of each item.
- then, ask to service LLM how the generated output is similar to the recorded output/answer in 1-10 scale score.
- aggregate scores on all items, then calculate the average score.
this script decides if the average score is any better
- than a given threshold (baseline)
- than the score recorded from the previous model
if yes, this script drops the following outputs (optionally, direct pushing to HF Model repo)
- model checkpoint
- evaluation results (average score, score on individual items)
- (TBD) how to push the model to a HF Model repo should be discussed in detail

3. Synthetic data generation script

this script leverages service LLM (Gemini, ChatGPT, ...) to generate synthetic data and add it to the existing train split. This script shouldn't look into the test split at all, but generate similar data by referring to the train split.

4. Deployment script

this script serves the fine-tuned model on {local, cloud} environment with TGI framework.
- Our primary target cloud environments are HF Space Repo with Hugging Chat template and dstack.
if there is already being served TGI instance, this script will terminate it and replace with the new one (new model)

5. Hooking up all scripts for automation

provide main.py script to run things in one place
provide GitHub Action to control things within GitHub repo

sayakpaul commented 5 months ago

Will start with Step 1.

deep-diver commented 5 months ago

@sayakpaul not step 0 but step 1, right?

sayakpaul commented 5 months ago

I believe you should be able to indendepently work on Step 2? @deep-diver

deep-diver commented 5 months ago

@sayakpaul that is correct!

sayakpaul commented 5 months ago

@sayakpaul not step 0 but step 1, right?

Step 0 will be coupled with Step 1.

deep-diver commented 5 months ago

Cool! Let's do this

sayakpaul commented 5 months ago

Dataset created: https://huggingface.co/datasets/sayakpaul/no_robots_only_coding

from datasets import load_dataset 

dataset = load_dataset("HuggingFaceH4/no_robots")
dataset["train_sft"] = dataset["train_sft"].filter(lambda example: example["category"] == "Coding")
dataset["test_sft"] = dataset["test_sft"].filter(lambda example: example["category"] == "Coding")

dataset.push_to_hub("no_robots_only_coding")

Going to spin up an experiment now.

sayakpaul commented 5 months ago

Used the following config to start the training run with https://github.com/huggingface/alignment-handbook:

# Model arguments
model_name_or_path: google/gemma-7b
model_revision: main
tokenizer_name_or_path: philschmid/gemma-tokenizer-chatml # Custom tokenizer with <|im_start|> and <|im_end|> tokens
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  sayakpaul/no_robots_only_coding: 1.0
dataset_splits:
- train_sft
- test_sft
preprocessing_num_workers: 12

# SFT trainer config
bf16: true
dataset_kwargs:
  add_special_tokens: false  # We already wrap <bos> and <eos> in the chat template
  append_concat_token: false # No need to add <eos> across samples
do_eval: true
evaluation_strategy: epoch
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: gemma-2b-sft-qlora-no-robots
hub_strategy: every_save
learning_rate: 2.0e-04
log_level: info
logging_steps: 5  
logging_strategy: steps
lr_scheduler_type: cosine
max_seq_length: 2048
max_steps: -1
num_train_epochs: 1
output_dir: data/gemma-2b-sft-qlora-no-robots
overwrite_output_dir: true
per_device_eval_batch_size: 8
per_device_train_batch_size: 4
push_to_hub: true
report_to:
- tensorboard
save_strategy: "steps"
save_steps: 100
save_total_limit: 1
seed: 42
warmup_ratio: 0.1

And then ran (from the root of https://github.com/huggingface/alignment-handbook):

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml scripts/run_sft.py recipes/zephyr-7b-gemma/sft/config_qlora.yaml

Model: https://huggingface.co/sayakpaul/gemma-2b-sft-qlora-no-robots/

It's incorrectly named. Should have been 7b.

deep-diver commented 5 months ago

cool! gonna experiment with that model. in the meantime, I have created a prompt to assess the outputs.

Given an instruction and two responses—one generated by a human and the other by a language model—I'm seeking to evaluate how closely the language model's response mirrors the human-generated one. Additionally, I want to assess the accuracy and relevance of the language model's response to the original instruction. 

Instruction: 
I have 2 lists below. Can you show me how to make a bar chart from them using Python?\n\nlist1 = [\"USA\", \"China\", \"Russia\", \"Brazil\", \"Japan\"]\nlist2 = [22, 53, 32, 45, 56]

Human Response: 
To install it with PIP, open your command line and enter: pip install matplotlib
To install it with Conda, enter: conda install -c conda-forge matplotlib
Then you can import the module into your Python code. Here is the full code:
from matplotlib import pyplot as plt

list1 = ["USA", "China", "Russia", "Brazil", "Japan"]
list2 = [22, 53, 32, 45, 56]

plt.bar(list1, list2)
plt.show()
This will display a bar chart with list1 on the x-axis and list2 on the y-axis.

Language Model Response: 
import matplotlib.pyplot as plt

# Create the bar chart
plt.bar(list1, list2)

# Set labels and title
plt.xlabel("Country")
plt.ylabel("Number of people")
plt.title("Number of people in each country")

# Show the chart
plt.show()

Please analyze the similarity between the Human Response and the Language Model Response on a scale of 1 to 100, where 1 indicates no similarity and 100 indicates identical responses. Then, evaluate the Language Model Response for its precision and relevance in answering the given Instruction on the same 1 to 100 scale.

Provide the assessment in the following JSON format:

{
  “similarity_assessment”: {
    “score”: [Insert similarity score here],
    “reason”: [Insert how the similarity score is determined],
  },
  “precision_assessment”: {
    "score": [Insert precision score here],
    “reason”: [Insert how the precision score is determined],
  }
}

then I get the following output:

{
   “similarity_assessment”: {
      “score”: 80,
      “reason”: The responses include the essential elements of creating a bar chart using 
matplotlib in Python, such as importing the necessary library, specifying the data for the x and y axes, and displaying the chart. The main difference is that the human response provides more detailed installation instructions, while the language model response focuses on the code for generating the bar chart.
   },
   “precision_assessment”: {
      "score": 95,
      “reason”: The language model's response accurately and concisely provides the steps needed to create a bar chart using Python. It includes the relevant code, sets labels and a title for the chart, and displays it. The response demonstrates a good understanding of matplotlib and adheres to the original instructions effectively.
   }
}

If there is something you want to add your thoughts on, please let me know!

sayakpaul commented 5 months ago

Thank you! Will set up a GitHub Action for this. Could you set the Gemini API Key in the secrets?

deep-diver commented 5 months ago

sure @sayakpaul

but, first let me write this in a script. and, the model you shared "https://huggingface.co/sayakpaul/gemma-2b-sft-qlora-no-robots/" is already fine-tuned version?

sayakpaul commented 5 months ago

Yes it is. But the number of samples is incredibly low so the performance might be bad. But I think that is not a blocker.

deep-diver commented 5 months ago

very nice :) I will drop the work on my side tomorrow!

deep-diver commented 5 months ago

@sayakpaul

created a notebook : https://colab.research.google.com/drive/1xvfWWjPNRLubsRcOVvumJXzDcvNrVZiW?usp=sharing

If you confirm it is OK, I will make a PR for this notebook first. Then, we can work on making things as a standalone codebase.

NOTE: the evaluation prompt currently works super bad. The model's output simply repeats the same word Survivor many times, but Gemini assessed it with about 80-90 score

sayakpaul commented 5 months ago

def get_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.bfloat16, load_in_4bit=True, device_map="auto"
    )

    return tokenizer, model

I used a different tokenizer. Refer to the comment. We should also follow how chat template is exactly applied to the inputs.

https://github.com/huggingface/alignment-handbook/blob/ba7e0e4fcaa60b8ffa35373de230d5c5cd2eccde/src/alignment/model_utils.py#L65

I will let you figure this out first and then we can take a closer look. WDYT?

deep-diver commented 5 months ago

Sounds cool!

deep-diver commented 5 months ago

@sayakpaul I don't fully understand the second part. Are you saying that get_tokenizer() is preferred way to get tokenizer?

sayakpaul commented 5 months ago

Yes, because that ensures that the same tokenizer is used during training and validation.

deep-diver / llamaduo

Roadmap #4

0. dataset preparation

1. fine-tuning script

2. evaluation script

3. Synthetic data generation script

4. Deployment script

5. Hooking up all scripts for automation