ConiferLM / Conifer

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models
78 stars 4 forks source link

Request for Training Code/Config #6

Open X1AOX1A opened 3 days ago

X1AOX1A commented 3 days ago

Hi Conifer Authors:

I have recently been working on reproducing the results of Conifer but noticed a significant discrepancy between my reproduced results and those reported in the paper.

Could you kindly share your training code or training configuration?

Reproduced Results

For the SFT on Mistral-V0.1, the IFEval-LoosePrompt (epoch 4) result is 43.80, while the paper reports 50.80. Additionally, we observed that the IFEval score for epoch 3 is 48.24, but the score drops significantly in epoch 4.

Config

Below is my reproduction configuration. I used LLaMAFactory and ran the experiment on 8 L40-48GB GPUs.

### model
model_name_or_path: mistralai/Mistral-7B-v0.1
run_name: conifer_mistral_7b_v0.1_sft_full

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: configs/deepspeed/ds_z3_config.json

### dataset
dataset: conifer, sharegpt
template: mistral
cutoff_len: 2048
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataset_dir: data

### output
output_dir: saves/sft/conifer/conifer_mistral_7b_v0.1/sft_full
logging_steps: 10
# save_steps: 500
save_strategy: epoch
save_total_limit: 4
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
learning_rate: 2.0e-05
num_train_epochs: 4
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
flash_attn: fa2
ddp_timeout: 180000000
seed: 42
report_to:
 - tensorboard
 - wandb

## eval
eval_dataset: ifeval
per_device_eval_batch_size: 32
eval_strategy: steps
eval_steps: 10
do_sample: False
temperature: 0
top_k: 0
top_p: 0
moyu-hrsun commented 2 days ago

Hi, the hyperparameters you listed in the config look the same as what we used, except for the per_device_train_batch_size and gradient_accumulation_steps, but it may not affect much since the global batch size is the same. We used the alignment-handbook to train the Conifer models, which enabled packing=True as its default setting. I'm not sure if this setting will affect the final performance.

X1AOX1A commented 2 days ago

Thanks for your response. I think there are two key differences between our implementations:

  1. Codebase.

  2. ShareGPT53K Dataset: I attempted to use the ShareGPT53K dataset as referenced in the README. However, after applying the filtering steps using the official code, I ended up with approximately 90K data samples. As a result, I opted to use the cleaned ShareGPT58K dataset from this repository.

To confirm if the difference stems from dataset processing, could you kindly share the exact cleaning script to obtain the ShareGPT53K dataset?

moyu-hrsun commented 13 hours ago

The ShareGPT 53K dataset is sampled from anon8231489123/ShareGPT_Vicuna_unfiltered, and we filter out the non-English instances using fastText toolkit (you can obtain the model from here). Here is a reproduced script to do something similar:

import json
import fasttext
import pandas as pd
import random

data_path = "./ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json"
model_path = "./lid.176.bin" # The fastText model

with open(data_path, 'r') as f:
    data = json.load(f)

model = fasttext.load_model(model_path)

messages = []

for sample in data:
    conversations = sample.get('conversations', [])
    if not conversations:
        continue

    temp_message = []

    for conv in conversations:
        role = 'user' if conv['from'] == 'human' else 'assistant'
        temp_message.append({'role': role, 'content': conv['value']})

    # chat format
    if len(temp_message) % 2 != 0:
        continue
    valid_conversation = all(
        (i % 2 == 0 and mes['role'] == 'user') or (i % 2 != 0 and mes['role'] == 'assistant')
        for i, mes in enumerate(temp_message)
    )
    if not valid_conversation:
        continue

    # language checking
    full_sentence = " ".join(mes['content'].replace('\n', ' ') for mes in temp_message)
    lang_label, lang_confidence = model.predict(full_sentence)
    if lang_label[0] != '__label__en' or lang_confidence[0] < 0.6:
        continue

    messages.append({'id': sample['id'], 'messages': temp_message})

random.seed(42)
random.shuffle(messages)

df = pd.DataFrame(messages, columns=['id', 'messages'])
print(df)
# save
df.to_parquet("./data/train.parquet")