Closed dvsrepo closed 1 year ago
Thinking about it, I'd recommend renaming/refactoring direct_preference_optimization
to be a more general way to collect preference data.
for_direct_preference_optimization
is linking the dataset to a specific algorithm rather than a task. An analogy would be having for_svm
or for_naive_bayes
instead of for_text_classification
. We'll see more and more algorithms for preference_tuning
or preference_optimization
and DPO is just one. I feel like using preference_tuning
or preference_optimization
would also be too narrow because one key use case of preference data is evaluation, but happy to discuss (maybe preference_optimization is better than preference collection?)
@dvsrepo I think I forgot to add it to the docs 😅
I agree with you @dvsrepo, however, the difficulty is that it is an iterative process where some things are required=True
others are required=False
if you know what I mean. However, I agree we might simplify but additionally I'm afraid that some users might not intuitively distinguish between text2text
and summarization/translation
or preference_modelling
and dpo/ppo
. Hence. I added the specific scenario's and differences.
@kursathalat can you add the preference modelling one to the docs?
@davidberenstein1957 I see, it's fine but I still think we should improve the preference modeling template to allow setting the desired number of responses (all required) and a ranking question instead of the rating one with two responses.
Yes, I opted for that but post-processing RankingQuestions is a bit unintuitive and me and @alvarobartt and I thought the binary preference tuning was more common currently. So, we can set it up if you think it is needed but we did make an evaluated choice not to do it in that way.
For many RLHF use cases you want to collect rankings more than 2 responses (as in the instructgpt paper) as it will give you more chosen, rejected pairs. Also allowing for ties is important.
@kursathalat,
for_direct_preference_optimization
-> preference_modeling
/reward_modeling
. like here https://github.com/argilla-io/argilla/blob/7c697ce54655194f33713c608d3dd79c68e23546/src/argilla/client/feedback/dataset/local/mixins.py#L722for_proximal_policy
optimization to a RatingQuestion
. https://github.com/argilla-io/argilla/blob/7c697ce54655194f33713c608d3dd79c68e23546/src/argilla/client/feedback/dataset/local/mixins.py#L731rating_scale
variables to a default of 7.number_of_responses
param for the for_direct_preference_optimization
/preference_modeling
/reward_modeling
.guidelines
Is your feature request related to a problem? Please describe. Hi! @kursathalat and @davidberenstein1957 doing a demo today, I've realized we don't have simple way to setup a preference dataset, besides the
rg.FeedbackDataset.for_direct_preference_optimization
which in my opinion is too "narrow" (e.g., you could use this for a reward model trainer or for llm evaluation) and doesn't use the ranking question.I understand that we have task-focused templates but I'd like to have more general-purpose datasets too (we will discuss how as we go), but in this specific case suggest creating something like:
Describe the solution you'd like