argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.83k stars 360 forks source link

[FEATURE] add support for infering `FeedbackDataset` structure in `from_huggingface` for transformer models #4037

Closed davidberenstein1957 closed 6 months ago

davidberenstein1957 commented 11 months ago

Is your feature request related to a problem? Please describe. I would like to focus on HF models.

Describe the solution you'd like https://huggingface.co/models has models categorized by task

import argilla as rg

rg.FeedbackDataset.from_huggingface(""ProsusAI/finbert")

Internally, something like this should happen, but Ideally we should avoid downloading the entire model and just use a config.

import argilla as rg
from transformers import pipeline
​
name = "sentiment-analysis"
pipe = pipeline(name)
​
ds = rg.FeedbackDataset.for_text_classification(
    labels=list(pipe.model.config.id2label.values()),
    multi_label=pipe.model.config.problem_type == "multi_label_classification"
)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

alvarobartt commented 11 months ago

Hi @davidberenstein1957, I think that re-using from_huggingface to not just load Argilla datasets dumped in the Hugging Face Hub, but also to load a configuration for any given model can be confusing to users and also confusing internally code-wise, so if this appears to happen I think we need to discuss about a proper method on doing so. Also the idea you propose I assume is to re-label already labelled datasets? If you could elaborate more over e.g. Notion and share with the team that would be great!

davidberenstein1957 commented 11 months ago

Hi @alvarobartt, it is not something that is directly happening or was mentioned anywhere. However, I was just dreaming and thinking a bit and given that have gotten a lot of mentions that people don't understand how to use and configure the dataset so things like the task_templates could help for those. It is not used to re-label a dataset but more so to easily configure and link them. Similar to the reasoning about using a default embedding_model and text descriptions metadata for datasets.

dvsrepo commented 11 months ago

I agree with @alvarobartt that from_huggingface might be confusing. I think this might be better placed in the task templates somehow but also we might want look at the bigger picture: associate hub model IDs with datasets for using them in different parts of the product (retraining, inference, etc.)

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 30 days since being marked as stale.