We should clean up and unify all contributions made until now + refactor main concepts such that they fulfill the following desiderata. Criteria is that they it is easy to understand what to do, when I want to perform any of the following:
[x] I want to generate texts about a certain topic
[x] I have unlabeled text and want to classify them into predefined categories such as for text classification
[x] I have unlabeled text (+ optional label) and want to generate related texts such as for NLI, summarization, QA
[x] I have tokens and want to annotate each token with a label such as for named entity recognition
This way I can create any dataset I want (generate texts / tokens from scratch and annotate them).
Following adjustments need to made (or at least checked how they are currently working):
[x] Generation: The minimal input to DatasetGenerator is a prompt template + task description. No fewshot examples or unlabeled data required (e.g. Write me news articles.)
[x] Generation: Include label options to generate texts for certain classes (e.g. Generate me a question type about class x for classes in X). It needs to be clear how to control the generated label distribution. I observed it's better not to let the LLM choose what to generate.
[x] Generation: Add fewshot examples: This needs to be combinable with the above in a sense that our repo automatically iterates over all classes in label options or in the fewshot examples such that the prompt is "generate me a question type about class x. here are y examples of class x."
[x] Provide prompt + unlabeled text. We can think of as plain annotation by instruction-tuned models (e.g. Generate a question to the given context).
[x] Annotation: Prompt + unlabeled text + label options: like the above but classes will automatically included into prompt.
[x] Annotation: Prompt + unlabeled text + fewshot examples: must be combined with the above. One needs to include now that label column from the fewshot dataset.
We should clean up and unify all contributions made until now + refactor main concepts such that they fulfill the following desiderata. Criteria is that they it is easy to understand what to do, when I want to perform any of the following:
This way I can create any dataset I want (generate texts / tokens from scratch and annotate them).
Following adjustments need to made (or at least checked how they are currently working):
DatasetGenerator
is a prompt template + task description. No fewshot examples or unlabeled data required (e.g. Write me news articles.)