Closed whoisjones closed 1 year ago
To make it clear for me:
Option 1 is generate unlabelled data? Like:
input_variables = ["text"] # Column names as they occur in the dataset
output_format = "text" # indicates the output format of the LLM is text
prompt = DataGenerationPrompt(
input_variables=input_variables,
output_format=output_format,
task_description="Generate similar texts.",
)
Option 2 is to annotate unlabelled data? Like:
input_variables = ["text"] # Column name from dataset
target_variable = "label" # Also column name from dataset, indicates the variable needs to be annotated
output_format = "single_label_classification" # Annotation format can be "text", "single_label", "multi_label", "token_classification" and determines how the LLM is prompted for the annotation
idx2label = {idx: key for idx, key in enumerate(fewshot_examples.features[target_variable].names)}
prompt = DataGenerationPrompt(
input_variables=input_variables,
output_format=output_format,
target_variable=target_variable,
classification_labels=idx2label,
task_description="Classify the review whether it's positive or negative",
)
Those checks ensure that we either are generating unlabelled data or annotate no?
yes and the 3rd check below is for when we annotate unlabeled data.
Current the valid options in our repo are: (1) Set an input_variable but no target variable and output_format == "text" and pass no unlabeled_data into the generate function. (2) Set input_variable + target_variable, output_format any of your choice and pass it with unlabeled_data to generate function.
Regarding (2), that looks intuitive to me. Fill everything, get your unlabeled data annotated. But (1) requires type check at various points of code. One idea might be to split the tasks to make this easier to understand.