argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.79k stars 354 forks source link

[FEATURE] Merging FeedbackDatasets #4984

Open mpjuhasz opened 2 months ago

mpjuhasz commented 2 months ago

Is your feature request related to a problem? Please describe. I'm working with multiple annotators, and have given them different workspaces. This results in multiple FeedbackDatasets to aggregate. I want to look at the IAA, but the metrics as per the docs only operate on one object.

Describe the solution you'd like A method for merging multiple datasets into one would allow users to use the metrics out of the box in cases like the above.

Describe alternatives you've considered I've worked around it by transforming the dataset into the Huggingface format merging those, extracting the config, and pushing all to the Huggingface Hub. Using FeedbackDataset.from_huggingface() then results in the required single object. This is rather tedious in the long run, as it requires pushing to and pulling from the hub for each aggregation.

Additional context N/A

burtenshaw commented 2 months ago

Hi @mpjuhasz

Thanks for the feature suggestion. This sounds like a cool idea. One question I have is:

Also, we have a major release of the SDK in beta right now. In this release (2.0), it will be possible to add the records of one dataset to another if their schemas are compatible. For example:

import argilla_sdk as rg

client = rg.Argilla(
    api_url="https://argilla.example.com",
    api_key="my_token",
)

dataset_a = client.datasets("dataset_a") # get the datasets from the argilla server
dataset_b = client.datasets("dataset_b") # get the datasets from the argilla server

dataset_a.records.log(list(dataset_b.records)) # add the records of dataset b to dataset a

We have a blog post on the new release that's coming at the end of the month.

mpjuhasz commented 2 months ago

Hi @burtenshaw,

Thanks for the quick response! My thoughts on the questions:

Looking forward to that release 🙌

burtenshaw commented 2 months ago

@mpjuhasz Nice. With that in mind, the 2.0 release should solve your use case. Let me know if you'd like to try out the experimental version in advance.