argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.91k stars 368 forks source link

Return `FeedbackDataset` URL in `push_to_argilla` and `push_to_huggingface` #3120

Closed alvarobartt closed 1 year ago

alvarobartt commented 1 year ago

Description

As it's considered useful, we could return the URL of a FeedbackDataset that has been pushed either to Argilla or to the HuggingFace Hub, to let the user easily access the dataset and know where it has been pushed.

Solution

Build the URL for both Argilla and the HuggingFace Hub pointing to the FeedbackDataset that was just uploaded to either one of those.

krishnajalan commented 1 year ago

can I pick this issue ? API_URL will be /api/v1/dataset/ right ? (referred from src/argilla/client/sdk/v1/datasets/api.py) if no need some time to find all the reference to API_URL some hint will help,

alvarobartt commented 1 year ago

Feel free to pick this @krishnajalan, just note that the API_URL is the one specified in rg.init(api_url=...) which can be retrieved as rg.active_client().api_url 😄

krishnajalan commented 1 year ago

hey @alvarobartt got Error when accessing rg.active_client().api_url => 'Argilla' object has no attribute 'api_url' but can we use rg.active_client().http_client.base_url ?

alvarobartt commented 1 year ago

True @krishnajalan, feel free to use rg.active_client().http_client.base_url instead, I thought we were setting self.api_url but seems that it's directly injected into the httpx client 👍🏻

krishnajalan commented 1 year ago

hey @alvarobartt some of the UT's are failing. which are unrelated to my changes. did I forget to do some steps ?

FAILED tests/training/test_span_marker.py::test_evaluate_train_test - TypeError: string indices must be integers
FAILED tests/training/test_span_marker.py::test_train_no_model - TypeError: string indices must be integers
FAILED tests/training/test_span_marker.py::test_various_inputs - TypeError: string indices must be integers
==== 4 failed, 1074 passed, 39 skipped, 10385 warnings in 710.44s (0:11:50) ====
Error: Process completed with exit code 1.
alvarobartt commented 1 year ago

Hi @krishnajalan, yes, it seems unrelated, I'll have a look at those, thanks for reporting!

alvarobartt commented 1 year ago

Any update on this @krishnajalan? We'd love to include it in the next Argilla release! 🔥

krishnajalan commented 1 year ago

yep have made the commit will create PR and put it for review by today, but UT's are failing will it be fine if I create PR with failing UTs?

alvarobartt commented 1 year ago

Thanks @krishnajalan, you can create the PR and the failing unit tests won't matter if unrelated, otherwise, those should pass before merging into develop. But anyway, feel free to create the PR as a draft so that we can review it and help you with the unit tests if needed 😄

dvsrepo commented 1 year ago

Hi! When we first discussed this I was thinking about our previous behavior. If you are pushing data with a script or from a notebook you can easily click the link and go to the dataset.

So this is how I see it:

So it's more about showing info messages than making the methods return the URL (which is also fine I guess but not as useful).

For inspiration, if I recall correctly, wandb shows a nicely formatted table with the links to the run experiment.

alvarobartt commented 1 year ago

@krishnajalan maybe you can have a look at the comment above from @dvsrepo where he shares his thoughts on the next steps to tackle the current issue 😄

krishnajalan commented 1 year ago

will printing the formatted URL work ? print(f"Argilla Dataset URL: {url}") ? it will be clickable but is this the right way ?

dvsrepo commented 1 year ago

This is the strategy we use for rg.log @alvarobartt please confirm we can/should use the same approach

alvarobartt commented 1 year ago

This is the strategy we use for rg.log @alvarobartt please confirm we can/should use the same approach

Yes, indeed we can use the same approach, not sure about the Failed count, but for the rest feel free to re-use those messages @krishnajalan

davidberenstein1957 commented 1 year ago

@alvarobartt this could be closed right?

dvsrepo commented 1 year ago

is this really done? Could you point me at the PR tackling this specific issue?

alvarobartt commented 1 year ago

Hi @dvsrepo so now we're just returning the RemoteFeedbackDataset i.e. a FeedbackDataset in Argilla, and we have the property url there, so one can do:

remote_dataset = dataset.push_to_argilla(name="my-dataset", workspace="my-workspace")
remote_dataset.url

So as we return the remote object instead we are not returning the URL, but we can create a mini-PR just to print it out automatically when pushing it to Argilla, even though users may additionally be able to just remote_dataset.url, WDYT? 😄

davidberenstein1957 commented 1 year ago

@alvarobartt I think both would be good.

dvsrepo commented 1 year ago

@alvarobartt yes, maybe the title/description of the issue was misleading but what I meant is to improve the usability by showing (print) a clickable URL pointing at the dataset just updated/created.

Also using a few days ago the previous rg.log vs push_to_argilla I notice the progress bar of rg.log looks nicer (using colab and jupyter notebooks within vscode), are we using the same library/function? If we are not using Rich for the new Feedback task progress bars I think we should.

You can create an issue covering these two enhancements and tag it as good first issue:

  1. Logging meaning information about the dataset update/creationg (at least the clickable URL)
  2. Using Rich if possible for the progress bar
alvarobartt commented 1 year ago

Sure @dvsrepo I'll create those and then close this one in favour of those ones! Thanks for reporting and following up!

dvsrepo commented 1 year ago

Perfect @alvarobartt !

shahdghorsi commented 8 months ago

Hi @alvarobartt I am trying to push a dataset to my argilla endpoint as follows:

ds = rg.FeedbackDataset.from_huggingface("vegeta/testargilla")
ds.push_to_argilla(name="hf-vegeta", workspace="test-workspace")

I am getting the following error: I am not sure why is it giving a deleting related error ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /Users/xxxxxx/site-packages/argilla/client/feedback/da │ │ taset/local/mixins.py:214 in __publish_dataset │ │ │ │ 211 │ @staticmethod │ │ 212 │ def __publish_dataset(client: "httpx.Client", id: UUID) -> None: │ │ 213 │ │ try: │ │ ❱ 214 │ │ │ datasets_api_v1.publish_dataset(client=client, id=id) │ │ 215 │ │ except Exception as e: │ │ 216 │ │ │ ArgillaMixin.__delete_dataset(client=client, id=id) │ │ 217 │ │ │ raise Exception(f"Failed while publishing theFeedbackDataset` in Argilla w │ │ │ │ /Users/xxxxxxx/site-packages/argilla/client/sdk/v1/data │ │ sets/api.py:139 in publish_dataset │ │ │ │ 136 │ │ response_obj = Response.from_httpx_response(response) │ │ 137 │ │ response_obj.parsed = FeedbackDatasetModel(response.json()) │ │ 138 │ │ return response_obj │ │ ❱ 139 │ return handle_response_error(response) │ │ 140 │ │ 141 │ │ 142 def list_datasets( │ │ │ │ /Users/xxxxxx/site-packages/argilla/client/sdk/commons │ │ /errors_handler.py:63 in handle_response_error │ │ │ │ 60 │ │ error_type = GenericApiError │ │ 61 │ else: │ │ 62 │ │ raise HttpResponseError(response=response) │ │ ❱ 63 │ raise error_type(error_args) │ │ 64 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ForbiddenApiError: Argilla server returned an error with http status: 403. Error details: {'response': '<!doctype html><meta name=viewport content="width=device-width, initial-scale=1">403403 Forbidden'}

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /Users/xxxxx/site-packages/argilla/client/feedback/da │ │ taset/local/mixins.py:92 in __delete_dataset │ │ │ │ 89 │ @staticmethod │ │ 90 │ def __delete_dataset(client: "httpx.Client", id: UUID) -> None: │ │ 91 │ │ try: │ │ ❱ 92 │ │ │ datasets_api_v1.delete_dataset(client=client, id=id) │ │ 93 │ │ except Exception as e: │ │ 94 │ │ │ raise Exception( │ │ 95 │ │ │ │ f"Failed while deleting the FeedbackDataset with ID '{id}' from Argill │ │ │ │ /Users/xxxxxxxxx/site-packages/argilla/client/sdk/v1/data │ │ sets/api.py:113 in delete_dataset │ │ │ │ 110 │ │ │ 111 │ if response.status_code == 200: │ │ 112 │ │ return Response.from_httpx_response(response) │ │ ❱ 113 │ return handle_response_error(response) │ │ 114 │ │ 115 │ │ 116 def publish_dataset( │ │ │ │ /Users/xxxxxxx/site-packages/argilla/client/sdk/commons │ │ /errors_handler.py:63 in handle_response_error │ │ │ │ 60 │ │ error_type = GenericApiError │ │ 61 │ else: │ │ 62 │ │ raise HttpResponseError(response=response) │ │ ❱ 63 │ raise error_type(**error_args) │ │ 64 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ForbiddenApiError: Argilla server returned an error with http status: 403. Error details: {'response': '<!doctype html><meta name=viewport content="width=device-width, initial-scale=1">403403 Forbidden'}

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /Users/xxxxx/site-packages/argilla/client/feedback/da │ │ taset/local/mixins.py:258 in push_to_argilla │ │ │ │ 255 │ │ │ │ │ vectors_settings=self.vectors_settings, client=httpx_client, id=crea │ │ 256 │ │ │ │ ) │ │ 257 │ │ │ │ │ ❱ 258 │ │ │ ArgillaMixin.publish_dataset(client=httpx_client, id=created_dataset.id) │ │ 259 │ │ │ │ │ 260 │ │ │ # TODO: Remote dataset should connect all settings by API calls requested on │ │ 261 │ │ │ # Once is done, this prefetch info should be removed. │ │ │ │ /Users/xxxx/site-packages/argilla/client/feedback/da │ │ taset/local/mixins.py:216 in publish_dataset │ │ │ │ 213 │ │ try: │ │ 214 │ │ │ datasets_api_v1.publish_dataset(client=client, id=id) │ │ 215 │ │ except Exception as e: │ │ ❱ 216 │ │ │ ArgillaMixin.delete_dataset(client=client, id=id) │ │ 217 │ │ │ raise Exception(f"Failed while publishing the FeedbackDataset in Argilla w │ │ 218 │ │ │ 219 │ def push_to_argilla( │ │ │ │ /Users/xxxxxx/site-packages/argilla/client/feedback/da │ │ taset/local/mixins.py:94 in delete_dataset │ │ │ │ 91 │ │ try: │ │ 92 │ │ │ datasets_api_v1.delete_dataset(client=client, id=id) │ │ 93 │ │ except Exception as e: │ │ ❱ 94 │ │ │ raise Exception( │ │ 95 │ │ │ │ f"Failed while deleting the FeedbackDataset with ID '{id}' from Argill │ │ 96 │ │ │ ) from e │ │ 97 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: Failed while deleting the FeedbackDataset with ID 'xxxxxxx' from Argilla with exception: Argilla server returned an error with http status: 403. Error details: {'response': '<!doctype html>403403 Forbidden'}`

alvarobartt commented 8 months ago

Hi @shahdghorsi, thanks for reporting! Could you check whether your Argilla instance is running fine, and that you're using an owner user? It's weird because the issue raised is HTTP 403, so it may be due to missing permissions, just owners and workspace admins can create FeedbackDatasets in Argilla, so you may be using an user with an unauthorised role

shahdghorsi commented 8 months ago

Hi @alvarobartt, thanks for getting back to me. The Argilla instance is running well actually and the users added are all admins so I am not sure what was wrong

shahdghorsi commented 8 months ago

Hi @alvarobartt I am actually having a problem with this, the argilla endpoint is behind an IAP and I can connect easily but when I try to create push a dataset from my local to a specific workspace I get the following: ``` ForbiddenApiError: Argilla server returned an error with http status: 403. Error details: {'response': '<!doctype html>403403 Forbidden'}


I can see the dataset name in the UI when I login but it says : 0 results found despite passing data.
The code I am using works perfectly for an instance started on my local host but not for the actual endpoint that I want to use. 

I am using     `rg.log(records, name= argilla_data_name,  workspace = "test-workspace")` instead of push_to_argilla

Could you please help?
Thanks,
shahdghorsi commented 8 months ago

Hi @alvarobartt I am actually having a problem with this, the argilla endpoint is behind an IAP and I can connect easily but when I try to create push a dataset from my local to a specific workspace I get the following: ``` ForbiddenApiError: Argilla server returned an error with http status: 403. Error details: {'response': '<!doctype html>403403 Forbidden'}

I can see the dataset name in the UI when I login but it says : 0 results found despite passing data.
The code I am using works perfectly for an instance started on my local host but not for the actual endpoint that I want to use. 

I am using     `rg.log(records, name= argilla_data_name,  workspace = "test-workspace")` instead of push_to_argilla

Could you please help?
Thanks,

Actually, I resolved that error after removing the following part of my code where I was adding some extra lables I am not sure what is wrong with this and why it works without it and not when I add it backm settings = rg.TextClassificationSettings(label_schema=set(label_list))

    rg.configure_dataset_settings(name=argilla_data_name, 
                                  settings=settings,
                                  workspace= "test-workspace")