e-p-armstrong / augmentoolkit

Convert Compute And Books Into Instruct-Tuning Datasets (or classifiers)!
MIT License
744 stars 103 forks source link

Save datasets on Hugging Face ? #36

Closed lhoestq closed 2 weeks ago

lhoestq commented 1 month ago

Hi team ! I'm Quentin from HF

Great work on this, IMO people are not aware yet of the power of synthetic data and how it can bootstrap any project, so this tool could help a lot :)

Anyway, I was wondering if you had considered allowing to save the generated datasets on HF ? If users are ok with sharing publicly the data, they could have a nice impact for the community (note that it's also possible to save private datasets on HF)

e-p-armstrong commented 1 month ago

Hi Quentin, yeah that sounds like a great feature to add!

I'll look into incorporating this sometime this week, don't know why I didn't think of it earlier.

Thanks for checking out Augmentoolkit BTW!

e-p-armstrong commented 1 month ago

Hey, was looking at trying to add Huggingface Datasets support to Augmentoolkit. Ran into some problems, would you maybe mind helping me resolve them @lhoestq ?

Using the docs I was able to add code that pushes data to the hub. However, currently each push overwrites and deletes the previously-pushed dataset with some new commits.

I'm trying a workaround where I save each dataset as a "split", but my adaption of the code from the Docs is very much not working:

with open(output_file_path, "w") as f:
        existing_files = glob.glob(
            os.path.join(output_dir, "*.yaml")
        )

        for file in existing_files:
            with open(file,'r') as file2:
                file_list_of_dicts = yaml.safe_load(file2)

            # print(file_list_of_dicts)

            sysprompt = {"from": "system", "value": file_list_of_dicts[0]["content"]}
            input = {"from": "human", "value": file_list_of_dicts[-2]["content"]}
            output = {"from": "gpt", "value": file_list_of_dicts[-1]["content"]}

            json_to_write = {"conversations": [sysprompt, input, output]}

            f.write(json.dumps(json_to_write) + "\n")
    print("...Converted successfully (we think)")
    if os.path.exists(output_file_path):
        dataset = load_dataset("json",data_files=output_file_path)
        dataset.push_to_hub(HUB_PATH, split=directory.split("_")[0], private=PRIVATE,)

Results in:

 File "/Users/evan/repos/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 94, in convert_logging_to_dataset
    dataset.push_to_hub(HUB_PATH, split=directory.split("_")[0], private=PRIVATE,)
TypeError: DatasetDict.push_to_hub() got an unexpected keyword argument 'split'

Despite the signature of the .push_to_hub being:

(method) push_to_hub: ((repo_id: Any, config_name: str = "default", set_default: bool | None = None, data_dir: str | None = None, commit_message: str | None = None, commit_description: str | None = None, private: bool | None = False, token: str | None = None, revision: str | None = None, branch: str = "deprecated", create_pr: bool | None = False, max_shard_size: int | str | None = None, num_shards: Dict[str, int] | None = None, embed_external_files: bool = True) -> CommitInfo) | ((repo_id: str, config_name: str = "default", set_default: bool | None = None, split: str | None = None, data_dir: str | None = None, commit_message: str | None = None, commit_description: str | None = None, private: bool | None = False, token: str | None = None, revision: str | None = None, branch: str = "deprecated", create_pr: bool | None = False, max_shard_size: int | str | None = None, num_shards: int | None = None, embed_external_files: bool = True) -> CommitInfo) | Any

The documentation also seems to claim that split= is a valid kwarg:

train_dataset.push_to_hub("<organization>/<dataset_id>", split="train")
val_dataset.push_to_hub("<organization>/<dataset_id>", split="validation")
# later
dataset = load_dataset("<organization>/<dataset_id>")
train_dataset = dataset["train"]
val_dataset = dataset["validation"]

Do you know what's going on here? Appreciate the support!

lhoestq commented 3 weeks ago

Hi ! split is a valid argument of Dataset.push_to_hub, while DatasetDict.push_to_hub uploads each dataset in the dictionary as separate splits named after the dictionary keys. You can get a Dataset by requesting the "train" split in load_dataset(..., split="train")

Anyway we haven't implemented append-mode in push_to_hub yet, in the meantime you can maybe upload a new Parquet file every time you have new data:

part_nb = directory.split("_")[0]
dataset.to_parquet(f"hf://datasets/{HUB_PATH}/data/train-{part_nb}.parquet")

Would that work for you ? Then when someone reloads the dataset, they would get the data resulting from the concatenation fo all the Parquet files

e-p-armstrong commented 2 weeks ago

Hey thanks for the explanation! In the most recent commits I've added experimental push to hub functionality. The files are maybe a bit disorganized in the created repo, but they're there!

lhoestq commented 1 week ago

Cool ! do you have an example repo somewhere I can share with the community ?