OpenBMB / UltraFeedback

A large-scale, fine-grained, diverse preference dataset (and models).
MIT License
294 stars 16 forks source link

The overall score is not matching with the principles #11

Open ASC-Competition opened 8 months ago

ASC-Competition commented 8 months ago

Hi, I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl dataset which the principle is 100% helpfulness.

for example, the scores of 9th sample in evol_instruct.jsonl dataset is as following:

models helpfulness honesty instruction following truthfulness overall score
gpt-3.5-turbo 4 5 4 5 7
llama-2-70b-chat 4 4 5 5 7.5
mpt-30b-chat 3 4 3 5 6.5
vicuna-33b 5 4 4 5 6.5

The answer of vicuna-33b has the highest helpfulness but lowest overall score.

My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.

Any suggestions will be appriciated, thx.

lifan-yuan commented 7 months ago

Hi,

Thanks for your interest.

The overall and fine-grained scores are annotated in different schemas and thus may not strictly match each other. Specifically, fine-grained scores are annotated according to our hand-written documentation, while overall scores totally rely on GPT-4 itself with the textual critique being the CoT rationale for scoring.

We investigated the effects of both kinds of scores in our paper (See section 4.1) and found that using fine-grained scores was slightly better. But note that the experiments were based on the previous "bugged" version of overall scores (see this issue), and we are not sure if the conclusion in the paper still apply to our updated scores.

Hope this helps.