Open X1AOX1A opened 3 days ago
Hi, the hyperparameters you listed in the config look the same as what we used, except for the per_device_train_batch_size
and gradient_accumulation_steps
, but it may not affect much since the global batch size is the same. We used the alignment-handbook to train the Conifer models, which enabled packing=True
as its default setting. I'm not sure if this setting will affect the final performance.
Thanks for your response. I think there are two key differences between our implementations:
Codebase.
ShareGPT53K Dataset: I attempted to use the ShareGPT53K dataset as referenced in the README. However, after applying the filtering steps using the official code, I ended up with approximately 90K data samples. As a result, I opted to use the cleaned ShareGPT58K dataset from this repository.
To confirm if the difference stems from dataset processing, could you kindly share the exact cleaning script to obtain the ShareGPT53K dataset?
The ShareGPT 53K dataset is sampled from anon8231489123/ShareGPT_Vicuna_unfiltered, and we filter out the non-English instances using fastText toolkit (you can obtain the model from here). Here is a reproduced script to do something similar:
import json
import fasttext
import pandas as pd
import random
data_path = "./ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json"
model_path = "./lid.176.bin" # The fastText model
with open(data_path, 'r') as f:
data = json.load(f)
model = fasttext.load_model(model_path)
messages = []
for sample in data:
conversations = sample.get('conversations', [])
if not conversations:
continue
temp_message = []
for conv in conversations:
role = 'user' if conv['from'] == 'human' else 'assistant'
temp_message.append({'role': role, 'content': conv['value']})
# chat format
if len(temp_message) % 2 != 0:
continue
valid_conversation = all(
(i % 2 == 0 and mes['role'] == 'user') or (i % 2 != 0 and mes['role'] == 'assistant')
for i, mes in enumerate(temp_message)
)
if not valid_conversation:
continue
# language checking
full_sentence = " ".join(mes['content'].replace('\n', ' ') for mes in temp_message)
lang_label, lang_confidence = model.predict(full_sentence)
if lang_label[0] != '__label__en' or lang_confidence[0] < 0.6:
continue
messages.append({'id': sample['id'], 'messages': temp_message})
random.seed(42)
random.shuffle(messages)
df = pd.DataFrame(messages, columns=['id', 'messages'])
print(df)
# save
df.to_parquet("./data/train.parquet")
Hi Conifer Authors:
I have recently been working on reproducing the results of Conifer but noticed a significant discrepancy between my reproduced results and those reported in the paper.
Could you kindly share your training code or training configuration?
Reproduced Results
For the SFT on Mistral-V0.1, the IFEval-LoosePrompt (epoch 4) result is 43.80, while the paper reports 50.80. Additionally, we observed that the IFEval score for epoch 3 is 48.24, but the score drops significantly in epoch 4.
Config
Below is my reproduction configuration. I used LLaMAFactory and ran the experiment on 8 L40-48GB GPUs.