argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.52k stars 116 forks source link

[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM) #117

Open dvsrepo opened 10 months ago

dvsrepo commented 10 months ago

The idea would be to build and run a benchmark with at least the following datasets: HHH Alignment & MT Bench Human Judgment.

Our current preference task are:

The main idea is to compute the chosen and rejected response and compare it with the ones in the benchmark. Based on this we can compute typical classification metrics (accuracy, precision, recall, f1)

This benchmark will be very useful as we can run it when we develop or integrate new techniques.

zucchini-nlp commented 10 months ago

I ran the datasets for each of the tasks. From the hhh alignment dataset, I took only the "other" part, and for the mt bench the first 100 questions. The mt bench dataset has also a "tie" label, so it is more of a multiclass classification. As an LLM I used ChatGPT.

In general, the hhh alignment is labelled almost in the same was as humans would, while mt bench has very low agreement with humans. For now I only ran to get metrics and I plan to look why exactly the mt bench is getting such scores. Below are the results for now:

  1. HHH for Text Quality vs 2. MT Bench for Text Quality Accuracy: 0.930, Recall: 0.930, Precision: 1.000, F1: 0.964 Accuracy: 0.350, Recall: 0.350, Precision: 0.357, F1: 0.351

  2. HHH for UltraJudge vs 2. MT Bench for UltraJudge Accuracy: 0.744, Recall: 0.744, Precision: 1.000, F1: 0.853 Accuracy: 0.290, Recall: 0.290, Precision: 0.325, F1: 0.292

  3. HHH for JudgeLM vs 2. MT Bench for JudgeLM Accuracy: 0.767, Recall: 0.767, Precision: 1.000, F1: 0.868 Accuracy: 0.390, Recall: 0.390, Precision: 0.325, F1: 0.341

dvsrepo commented 10 months ago

this is awesome @zucchini-nlp !

I think the code to run the benchmark would be a great contribution to the repo!

zucchini-nlp commented 10 months ago

Yes, I opened PR 131

ashim-mahara commented 2 months ago

@dvsrepo Close as completed?