Fix train data script - Githubissues

allenai / open-instruct

Apache License 2.0

1.08k stars 140 forks source link

Fix train data script #156

Closed natolambert closed 2 days ago

natolambert commented 2 months ago

Closes #153, makes it so token isnt needed (can use huggingface-cli), I tested this to retrain OLMo 1.7

hamishivi commented 1 month ago

LGTM so long as you've verified the data is the same as our released mixture.

natolambert commented 1 month ago

Code for comparing diffs:

from datasets import load_dataset

def load_and_compare_datasets(dataset_name1, dataset_name2, split='train'):
    # Load datasets
    dataset1 = load_dataset(dataset_name1, split=split)
    dataset2 = load_dataset(dataset_name2, split=split)

    # Create dictionaries indexed by 'id'
    dict1 = {row['id']: row for row in dataset1}
    dict2 = {row['id']: row for row in dataset2}

    # Find unique ids in each dataset
    unique_ids_to_set1 = set(dict1.keys()) - set(dict2.keys())
    unique_ids_to_set2 = set(dict2.keys()) - set(dict1.keys())

    # Print unique entries
    print("Entries unique to dataset 1:")
    for id in unique_ids_to_set1:
        print(dict1[id])

    print("\nEntries unique to dataset 2:")
    for id in unique_ids_to_set2:
        print(dict2[id])

# Example usage with dataset names and optional split specification
load_and_compare_datasets('ai2-adapt-dev/tulu2-tmp', 'allenai/tulu-v2-sft-mixture', split='train')

natolambert commented 1 month ago

There are minor differences, mostly via wizardlm taking data down and maybe slight changes to the dataset processing script from the v2 version on huggingface. New backups made:

Tulu 2 as of now: https://huggingface.co/datasets/ai2-adapt-dev/tulu2-tmp
WizardLM from original tulu 2 mix: https://huggingface.co/datasets/ai2-adapt-dev/wizardlm-backup

hamishivi commented 1 month ago

So now the script points to this backup wizardlm version that yields some small differences when you run the train prep script?

natolambert commented 1 month ago

@hamishivi, yes. Backup was taken from your nfs raw train files. I can get the date it was last modified, but I think there are multiple things that have potentially changed since we uploaded tulu v2 sft mix

cbfcbf commented 1 week ago

I found a typo at line 86 reformat_dataset.py : "if num_few_shot_examples > 0" should be "if num_zero_shot_examples > 0"

hamishivi commented 2 days ago

Okay, I think this is okay to merge. I ran the script through without errors. I'll add a note saying a samples have shifted, pointing to this PR. Since we have the mixture we actually trained on uploaded already, I think this is okay. Also fixed up the bug, thanks!

natolambert commented 2 days ago

Needs the special approval button :)