Closed natolambert closed 2 days ago
LGTM so long as you've verified the data is the same as our released mixture.
Code for comparing diffs:
from datasets import load_dataset
def load_and_compare_datasets(dataset_name1, dataset_name2, split='train'):
# Load datasets
dataset1 = load_dataset(dataset_name1, split=split)
dataset2 = load_dataset(dataset_name2, split=split)
# Create dictionaries indexed by 'id'
dict1 = {row['id']: row for row in dataset1}
dict2 = {row['id']: row for row in dataset2}
# Find unique ids in each dataset
unique_ids_to_set1 = set(dict1.keys()) - set(dict2.keys())
unique_ids_to_set2 = set(dict2.keys()) - set(dict1.keys())
# Print unique entries
print("Entries unique to dataset 1:")
for id in unique_ids_to_set1:
print(dict1[id])
print("\nEntries unique to dataset 2:")
for id in unique_ids_to_set2:
print(dict2[id])
# Example usage with dataset names and optional split specification
load_and_compare_datasets('ai2-adapt-dev/tulu2-tmp', 'allenai/tulu-v2-sft-mixture', split='train')
There are minor differences, mostly via wizardlm taking data down and maybe slight changes to the dataset processing script from the v2 version on huggingface. New backups made:
So now the script points to this backup wizardlm version that yields some small differences when you run the train prep script?
@hamishivi, yes. Backup was taken from your nfs raw train files. I can get the date it was last modified, but I think there are multiple things that have potentially changed since we uploaded tulu v2 sft mix
I found a typo at line 86 reformat_dataset.py : "if num_few_shot_examples > 0" should be "if num_zero_shot_examples > 0"
Okay, I think this is okay to merge. I ran the script through without errors. I'll add a note saying a samples have shifted, pointing to this PR. Since we have the mixture we actually trained on uploaded already, I think this is okay. Also fixed up the bug, thanks!
Needs the special approval button :)
Closes #153, makes it so token isnt needed (can use huggingface-cli), I tested this to retrain OLMo 1.7