argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.91k stars 367 forks source link

[FEATURE] add method `rg.list_datasets()` #3250

Closed g-insana closed 1 year ago

g-insana commented 1 year ago

Is your feature request related to a problem? Please describe. I believe it would be a useful addition to the client API the possibility to query the datasets available in a workspace. For example

import argilla as rg
rg.init()
datasets = rg.list_datasets()

Although it is now possible to achieve this using datasets = rg.active_client().http_client.get("/api/datasets") I believe it would be more intuitive to have a method for it.

Describe the solution you'd like Keeping in line with other api calls, it would probably make sense to add optional arguments like workspace (Optional[str])

Online documentation should list this method and suggest usage examples, like the following:

datasets = rg.list_datasets()
dataset_names = [dataset['name'] for dataset in online_datasets]
datasets_df = pd.DataFrame.from_dict(online_datasets).set_index('name')
display(datasets_df[['task', 'last_updated']])

Describe alternatives you've considered For now, using rg.active_client().http_client.get("/api/datasets")

Additional context The reason for the request came when working on code to ensure persistence of a hf argilla space, periodically checking whether the data has been modified, taking dumps of datasets on local disk and restoring datasets on hf in case the space has undergone reboot (hence losing all data).

Discussed first on slack channel #02-support-and-questions on Jun 22nd 2023. Invited by @davidberenstein1957 to submit this as an issue.

davidberenstein1957 commented 1 year ago

@alvarobartt, did you want to include this during the workspace management efforts too?

alvarobartt commented 1 year ago

Not yet, but for sure we can tackle it in the next release, the only consideration is that we're using two different APIs, one of the TextClassification, TokenClassification, and Text2Text datasets and another one (v1) for the FeedbackDataset, we need to evaluate whether it's worth merging both outputs or just providing different functions