Closed lhoestq closed 2 weeks ago
Hi Quentin, yeah that sounds like a great feature to add!
I'll look into incorporating this sometime this week, don't know why I didn't think of it earlier.
Thanks for checking out Augmentoolkit BTW!
Hey, was looking at trying to add Huggingface Datasets support to Augmentoolkit. Ran into some problems, would you maybe mind helping me resolve them @lhoestq ?
Using the docs I was able to add code that pushes data to the hub. However, currently each push overwrites and deletes the previously-pushed dataset with some new commits.
I'm trying a workaround where I save each dataset as a "split", but my adaption of the code from the Docs is very much not working:
with open(output_file_path, "w") as f:
existing_files = glob.glob(
os.path.join(output_dir, "*.yaml")
)
for file in existing_files:
with open(file,'r') as file2:
file_list_of_dicts = yaml.safe_load(file2)
# print(file_list_of_dicts)
sysprompt = {"from": "system", "value": file_list_of_dicts[0]["content"]}
input = {"from": "human", "value": file_list_of_dicts[-2]["content"]}
output = {"from": "gpt", "value": file_list_of_dicts[-1]["content"]}
json_to_write = {"conversations": [sysprompt, input, output]}
f.write(json.dumps(json_to_write) + "\n")
print("...Converted successfully (we think)")
if os.path.exists(output_file_path):
dataset = load_dataset("json",data_files=output_file_path)
dataset.push_to_hub(HUB_PATH, split=directory.split("_")[0], private=PRIVATE,)
Results in:
File "/Users/evan/repos/augmentoolkit/augmentoolkit/control_flow_functions/control_flow_functions.py", line 94, in convert_logging_to_dataset
dataset.push_to_hub(HUB_PATH, split=directory.split("_")[0], private=PRIVATE,)
TypeError: DatasetDict.push_to_hub() got an unexpected keyword argument 'split'
Despite the signature of the .push_to_hub being:
(method) push_to_hub: ((repo_id: Any, config_name: str = "default", set_default: bool | None = None, data_dir: str | None = None, commit_message: str | None = None, commit_description: str | None = None, private: bool | None = False, token: str | None = None, revision: str | None = None, branch: str = "deprecated", create_pr: bool | None = False, max_shard_size: int | str | None = None, num_shards: Dict[str, int] | None = None, embed_external_files: bool = True) -> CommitInfo) | ((repo_id: str, config_name: str = "default", set_default: bool | None = None, split: str | None = None, data_dir: str | None = None, commit_message: str | None = None, commit_description: str | None = None, private: bool | None = False, token: str | None = None, revision: str | None = None, branch: str = "deprecated", create_pr: bool | None = False, max_shard_size: int | str | None = None, num_shards: int | None = None, embed_external_files: bool = True) -> CommitInfo) | Any
The documentation also seems to claim that split= is a valid kwarg:
train_dataset.push_to_hub("<organization>/<dataset_id>", split="train")
val_dataset.push_to_hub("<organization>/<dataset_id>", split="validation")
# later
dataset = load_dataset("<organization>/<dataset_id>")
train_dataset = dataset["train"]
val_dataset = dataset["validation"]
Do you know what's going on here? Appreciate the support!
Hi ! split
is a valid argument of Dataset.push_to_hub
, while DatasetDict.push_to_hub
uploads each dataset in the dictionary as separate splits named after the dictionary keys. You can get a Dataset
by requesting the "train" split in load_dataset(..., split="train")
Anyway we haven't implemented append-mode in push_to_hub yet, in the meantime you can maybe upload a new Parquet file every time you have new data:
part_nb = directory.split("_")[0]
dataset.to_parquet(f"hf://datasets/{HUB_PATH}/data/train-{part_nb}.parquet")
Would that work for you ? Then when someone reloads the dataset, they would get the data resulting from the concatenation fo all the Parquet files
Hey thanks for the explanation! In the most recent commits I've added experimental push to hub functionality. The files are maybe a bit disorganized in the created repo, but they're there!
Cool ! do you have an example repo somewhere I can share with the community ?
Hi team ! I'm Quentin from HF
Great work on this, IMO people are not aware yet of the power of synthetic data and how it can bootstrap any project, so this tool could help a lot :)
Anyway, I was wondering if you had considered allowing to save the generated datasets on HF ? If users are ok with sharing publicly the data, they could have a nice impact for the community (note that it's also possible to save private datasets on HF)