argilla-io / argilla-plugins

🔌 Open-source plugins for with practical features for Argilla using listeners.
Apache License 2.0
6 stars 2 forks source link

[NEW] HF Sync plugin #36

Closed frascuchon closed 5 months ago

frascuchon commented 1 year ago

With this plugin, users can sync data from/to the HF datasets hub and an Argilla instance.

Refs https://github.com/argilla-io/argilla/issues/2614

Description

Since this plugin has several optional configurations, can cover several use cases. Let's enumerate some of them

  1. Automatically import an HF dataset into your Argilla instance (by setting the hf_source)
  2. Export total or partial dataset records into an HF dataset (by setting the hf_target and rg_dataset parameters)
  3. Sync HF datasets with changes from Argilla (by setting the hf_source=hf_target and rg_query=None). Be careful with this since the dataset will be exported totally each hf_push_to_hub_frequency seconds.
dvsrepo commented 1 year ago

Looks good!!

One question:

This looks like a two-way sync like the one we have for Alpaca (exactly what we want for Spaces)

For more general use cases, could/do we also cover the use case where a user just wants to make sure the dataset is stored in a Hub Dataset periodically and not loading the dataset from the Hub back to the Argilla dataset?

(Note: I only skimmed through the code)

frascuchon commented 1 year ago

Looks good!!

One question:

This looks like a two-way sync like the one we have for Alpaca (exactly what we want for Spaces)

For more general use cases, could/do we also cover the use case where a user just wants to make sure the dataset is stored in a Hub Dataset periodically and not loading the dataset from the Hub back to the Argilla dataset?

(Note: I only skimmed through the code)

I've included some in the PR description but, yes. The idea is, by skipping the hf_source the sync will be only in one way from Argilla to HF Dataset hub. The performance of this code is not the best, but some tests that I've done were working fine.

frascuchon commented 1 year ago

I will try to take a look tomorrow with your feedback.

davidberenstein1957 commented 1 year ago

Additionally, if the code is not robust enough, you could try to setup a retry-backoff with smaller chunk sizes?Cheers,David On 30 Mar 2023, at 18:43, Francisco Aranda @.***> wrote: I will try to take a look tomorrow with your feedback.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: @.***>

frascuchon commented 1 year ago

Additionally, if the code is not robust enough, you could try to setup a retry-backoff with smaller chunk sizes?Cheers,David On 30 Mar 2023, at 18:43, Francisco Aranda @.> wrote: I will try to take a look tomorrow with your feedback. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: @.>

Great! But I think this behavior should be in the core rg.log/rg.load methods

frascuchon commented 1 year ago

The plugin is working now by using environment variables. The name of variables is aligned with the naming of attributes in HuggingfaceSyncConfig class.

I've created a space here.

As TODO that can be tackled in another PR:

Also, it would be great to package the argilla-plugins package with the quickstart image.

dvsrepo commented 1 year ago

ok!

what's left to close this version @frascuchon @davidberenstein1957 ?

dvsrepo commented 1 year ago

Maybe a brief description of how to use it?