[FEATURE] Create a task template for Preference Data Collection

argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets

https://docs.argilla.io

Apache License 2.0

4.04k stars 381 forks source link

[FEATURE] Create a task template for Preference Data Collection #4024

Closed dvsrepo closed 1 year ago

dvsrepo commented 1 year ago

Is your feature request related to a problem? Please describe. Hi! @kursathalat and @davidberenstein1957 doing a demo today, I've realized we don't have simple way to setup a preference dataset, besides the rg.FeedbackDataset.for_direct_preference_optimization which in my opinion is too "narrow" (e.g., you could use this for a reward model trainer or for llm evaluation) and doesn't use the ranking question.

I understand that we have task-focused templates but I'd like to have more general-purpose datasets too (we will discuss how as we go), but in this specific case suggest creating something like:

Describe the solution you'd like

# not the best name but we can brainstorm
ds = rg.FeedbackDataset.for_preference_collection(
   num_responses=3, # default is 2 
   ... # the rest can be the same as dpo
)

# FeedbackDataset(
#   fields=[
#       TextField(name="input", use_markdown=True),
#       TextField(name="context", use_markdown=True)
#       TextField(name="response-1", use_markdown=True),
#       TextField(name="response-2", use_markdown=True),
#       TextField(name="response-3", use_markdown=True),
#   ],
#   questions=[
#       RakingQuestion(name="preference", values=["response-1", "response-2", "response-3"])
#   ]
#   guidelines="<Guidelines for the task>",
# )

dvsrepo commented 1 year ago

Thinking about it, I'd recommend renaming/refactoring direct_preference_optimization to be a more general way to collect preference data.

for_direct_preference_optimization is linking the dataset to a specific algorithm rather than a task. An analogy would be having for_svm or for_naive_bayes instead of for_text_classification. We'll see more and more algorithms for preference_tuning or preference_optimization and DPO is just one. I feel like using preference_tuning or preference_optimization would also be too narrow because one key use case of preference data is evaluation, but happy to discuss (maybe preference_optimization is better than preference collection?)