StacklokLabs / promptwright

Generate large synthetic data using a local LLM
Apache License 2.0
199 stars 19 forks source link

Add tag and template dataset card when pushing to Hub #8

Open davanstrien opened 2 weeks ago

davanstrien commented 2 weeks ago

It's great to see a library focusing on using local LLMs for synthetic data generation!

When doing the push_to_hub, it could be nice to add a template dataset card or some tags to the dataset so it's easier to find datasets created using promptwright. To add tags, something like:

from huggingface_hub import DatasetCard

def update_card_with_tags(repo_id: str):
    card = DatasetCard.load(repo_id)
    # Initialize tags if not a list
    if not isinstance(card.data.tags, list):
        card.data.tags = []
    # Add tags if they don't already exist
    tags_to_add = ["promptwright", "synthetic"]
    for tag in tags_to_add:
        if tag not in card.data.tags:
            card.data.tags.append(tag)
    card.push_to_hub(repo_id)

would already help with discoverability. Example repo with these tags: https://huggingface.co/datasets/davanstrien/promptwright-test. You could also add a more expansive dataset card using a template in the future. There is a nice example from distilabel of this kind of template: https://github.com/argilla-io/distilabel/blob/main/src/distilabel/utils/card/distilabel_template.md

lukehinds commented 2 weeks ago

Great idea, thanks @davanstrien , will take a look!

davanstrien commented 2 weeks ago

Great idea, thanks @davanstrien , will take a look!

Awesome! Happy to review if useful :)

lukehinds commented 1 week ago

hey @davanstrien , I was thinking this would be a nice addition for you to add if you're up for it? if time is scarce, happy to pick it up. you seem to have figured out what needs doing anyhow, so makes sense for you to ship this.