Issue with `overall_score` computation

OpenBMB / UltraFeedback

A large-scale, fine-grained, diverse preference dataset (and models).

MIT License

297 stars 16 forks source link

Issue with `overall_score` computation #8

Open dvsrepo opened 9 months ago

dvsrepo commented 9 months ago

Hi!

Congrats on this amazing project.

We've been exploring the data and identified an issue with very high overall_score responses. The issue seems to be related with this line. This causes responses with a critique rating of 1 to become a 10. We noticed this by looking at the critique rational which was highly negative for many (~2K) examples with an overall_score of 10.

lifan-yuan commented 9 months ago

Hi!

Sorry for the late response and thanks for pointing that out! Yes, it seems to be a bug, and the ">" should be ">=".

Intuitively, a true 10 score should correspond to high fine-grained scores while a mistaken 10 relates to low ones. We will check all the 2k samples immediately.

dvsrepo commented 9 months ago

Here's a space we've been using to verify this:

https://argilla-ultrafeedback-curator.hf.space/dataset/39de1a2e-d905-46bd-b940-42e06b6e0c06/annotation-mode?_page=1&_status=discarded

(login with: owner/12345678)

The only issue is that there's some examples with overall_score 10 that are good (the majority are bad though).

We've been working on curating this data programmatically and with Argilla and we'd be super happy to contribute back

Thanks for building an amazing project!