flairNLP / fabricator

[EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.
Apache License 2.0
98 stars 12 forks source link

How to properly differentiate between generate unlabeled and annotate unlabeled data? #11

Closed whoisjones closed 1 year ago

whoisjones commented 1 year ago

Current the valid options in our repo are: (1) Set an input_variable but no target variable and output_format == "text" and pass no unlabeled_data into the generate function. (2) Set input_variable + target_variable, output_format any of your choice and pass it with unlabeled_data to generate function.

Regarding (2), that looks intuitive to me. Fill everything, get your unlabeled data annotated. But (1) requires type check at various points of code. One idea might be to split the tasks to make this easier to understand.

HallerPatrick commented 1 year ago

To make it clear for me:

Option 1 is generate unlabelled data? Like:

input_variables = ["text"] # Column names as they occur in the dataset
output_format = "text" # indicates the output format of the LLM is text
prompt = DataGenerationPrompt(
    input_variables=input_variables,
    output_format=output_format,
    task_description="Generate similar texts.",
)

Option 2 is to annotate unlabelled data? Like:

input_variables = ["text"]  # Column name from dataset
target_variable = "label"  # Also column name from dataset, indicates the variable needs to be annotated
output_format = "single_label_classification" # Annotation format can be "text", "single_label", "multi_label", "token_classification" and determines how the LLM is prompted for the annotation
idx2label = {idx: key for idx, key in enumerate(fewshot_examples.features[target_variable].names)}

prompt = DataGenerationPrompt(
    input_variables=input_variables,
    output_format=output_format,
    target_variable=target_variable,
    classification_labels=idx2label,
    task_description="Classify the review whether it's positive or negative",
)

Those checks ensure that we either are generating unlabelled data or annotate no?

whoisjones commented 1 year ago

yes and the 3rd check below is for when we annotate unlabeled data.