Loading an HF dataset with several subsets into a single Argilla dataset overrides the records when the dataset has no row id. Subsets have the same structure, so we could load them into the same Argilla dataset, but users need to build the record.id attribute to avoid overrides.
Stacktrace and Code to create the bug
dataset = rg.Dataset(...)
for config in ["gpqa_diamond", "gpqa_extended", "gpqa_main"]:
hf_ds = load_dataset("some-dataset-with-several-subsets", name=config, token=HF_TOKEN, split="train")
dataset.records.log(hf_ds.map(lambda r: {"subset": config})) # row_id + split will be use to identify record id -> overriding records
Describe the bug
Loading an HF dataset with several subsets into a single Argilla dataset overrides the records when the dataset has no row id. Subsets have the same structure, so we could load them into the same Argilla dataset, but users need to build the
record.id
attribute to avoid overrides.Stacktrace and Code to create the bug
Expected behavior
Environment:
Additional context