argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.58k stars 123 forks source link

Introduce JudgeLM (and brief discussion to finalized the structure) #36

Closed dvsrepo closed 11 months ago

dvsrepo commented 11 months ago

Hi!

We should include JudgeLM, I've been thinking about how to include it with regards our discussion about the class structure and how to include new approaches to highly similar tasks (e.g., preference).

So this issue is an open discussion with @alvarobartt and @gabrielmbmb to find the right balance (at least for this early release).

Here's the prompt template (untested), config and output:

judgelm.jinja: As you can see there's no rating list explaining what's a 1 and what's a 10.

[Question]
{{ instruction }}

{% for response in responses %}
[The Start of Assistant {{ loop.index }}'s Answer> 
{{ response }}
[The End of Assistant {{ loop.index }}'s Answer> 
{%- endfor %}

[System]
{{task_description}}

PreferenceTask settings:

task_description = dedent("""
    We would like to request your feedback on the performance of two AI assistants in response to the
    user question displayed above.
    Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant
    receives an overall score on a scale of 1 to 10, where a higher score indicates better overall
    performance.
    Please first output a single line containing only two values indicating the scores for Assistant 1 and
    2, respectively. The two scores are separated by a space. In the subsequent line, please provide a
    comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the
    order in which the responses were presented does not affect your judgment.
    """
)
system_prompt = "You are a helpful and precise assistant for checking the quality of the answer."
ratings = None

output I think they used a much simpler and clever way to generate the responses with much less tokens and faster (the ultrafeedback output is bloated).

2,10
Response 1 is a 2 because blah blah and response 2 is a 10 because it's awesome

Looking at this, we can't make this template work by reusing MultRatingsTask, because we need to rewrite the parse_output function. This means MultRatingsTask is not a good name.

Even if I'm not a big fan of this approach, we might need to name them: UltraFeedbackRating and JudgeMLRating? both implementing PreferenceTask.

What do you think? Are there any other ways, naming, structure? Otherwise is fine to go this way for now.

.

alvarobartt commented 11 months ago

Hey here! Nice idea to add JudgeLM, I think that we can have a separate judgelm.py file with JudgeLMRating for the moment and then explore whether there's a better solution, which probably there is, so lets explore that before the release!

dvsrepo commented 11 months ago

cool! and let's rename to UltraFeedbackRating too?

dvsrepo commented 11 months ago

@alvarobartt now I see the initial naming alignment, they are *Preference so: UltraFeedbackPreference, JudgeLMPreference 😃

alvarobartt commented 11 months ago

IMO the most compliant naming as of now are probably UltraFeedbackTask and JudgeLMTask? WDYT? We can do UltraFeedbackPreference and JudgeLMPreference otherwise, but since those are already imported from distilabel.tasks.preference maybe the "preference" part does not need to be present there?

dvsrepo commented 11 months ago

Maybe even remove the *Task?

alvarobartt commented 11 months ago

I'd keep it because we are assigning that to an arg named task so I feel like it's more intuitive to just go with task=JudgeLMTask rather than task=JudgeLM, WDYT?

dvsrepo commented 11 months ago

perfect