[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM)

dvsrepo commented 10 months ago

The idea would be to build and run a benchmark with at least the following datasets: HHH Alignment & MT Bench Human Judgment.

Our current preference task are:

UltraFeedback: with different aspects (honest, helpful, etc.) so we need to see how to compute the overall benchmark. We can start with the for_text_quality() aspect because it's a summary of the other aspects.
JudgeLM
UltraJudge: our own variation of uf and judgelm

The main idea is to compute the chosen and rejected response and compare it with the ones in the benchmark. Based on this we can compute typical classification metrics (accuracy, precision, recall, f1)

This benchmark will be very useful as we can run it when we develop or integrate new techniques.

we can start with a sample of each dataset is large

zucchini-nlp commented 10 months ago

I ran the datasets for each of the tasks. From the hhh alignment dataset, I took only the "other" part, and for the mt bench the first 100 questions. The mt bench dataset has also a "tie" label, so it is more of a multiclass classification. As an LLM I used ChatGPT.

In general, the hhh alignment is labelled almost in the same was as humans would, while mt bench has very low agreement with humans. For now I only ran to get metrics and I plan to look why exactly the mt bench is getting such scores. Below are the results for now:

HHH for Text Quality vs 2. MT Bench for Text Quality Accuracy: 0.930, Recall: 0.930, Precision: 1.000, F1: 0.964 Accuracy: 0.350, Recall: 0.350, Precision: 0.357, F1: 0.351
HHH for UltraJudge vs 2. MT Bench for UltraJudge Accuracy: 0.744, Recall: 0.744, Precision: 1.000, F1: 0.853 Accuracy: 0.290, Recall: 0.290, Precision: 0.325, F1: 0.292
HHH for JudgeLM vs 2. MT Bench for JudgeLM Accuracy: 0.767, Recall: 0.767, Precision: 1.000, F1: 0.868 Accuracy: 0.390, Recall: 0.390, Precision: 0.325, F1: 0.341

dvsrepo commented 10 months ago

this is awesome @zucchini-nlp !

I think the code to run the benchmark would be a great contribution to the repo!

zucchini-nlp commented 10 months ago

Yes, I opened PR 131

ashim-mahara commented 2 months ago

@dvsrepo Close as completed?

argilla-io / distilabel

[FEATURE] Benchmark existing preference tasks (UltraFeedback, UltraJudge, JudgeLM) #117