The overall score is not matching with the principles

OpenBMB / UltraFeedback

A large-scale, fine-grained, diverse preference dataset (and models).

MIT License

294 stars 16 forks source link

Hi, I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl dataset which the principle is 100% helpfulness.

for example, the scores of 9th sample in evol_instruct.jsonl dataset is as following:

models	helpfulness	honesty	instruction following	truthfulness	overall score
gpt-3.5-turbo	4	5	4	5	7
llama-2-70b-chat	4	4	5	5	7.5
mpt-30b-chat	3	4	3	5	6.5
vicuna-33b	5	4	4	5	6.5

The answer of vicuna-33b has the highest helpfulness but lowest overall score.

My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.

Any suggestions will be appriciated, thx.

Hi,

Thanks for your interest.

The overall and fine-grained scores are annotated in different schemas and thus may not strictly match each other. Specifically, fine-grained scores are annotated according to our hand-written documentation, while overall scores totally rely on GPT-4 itself with the textual critique being the CoT rationale for scoring.

We investigated the effects of both kinds of scores in our paper (See section 4.1) and found that using fine-grained scores was slightly better. But note that the experiments were based on the previous "bugged" version of overall scores (see this issue), and we are not sure if the conclusion in the paper still apply to our updated scores.

Hope this helps.

OpenBMB / UltraFeedback

The overall score is not matching with the principles #11