RLHF Rating Rubric Revision Recommendations

I feel that the current prompt/answer rating system can be too subjective and unclear at times, which may be affecting RLHF quality. So, from current experiences, here's my proposal on a modified rating system (comments in parenthesis):

Required Tag: Is this prompt/answer spam? Defined "spam" as:
- Irrelevant to the current conversation tree.
- Wrong language.
- Answers in prompt.
- Extremely low effort.
- (Basically, contents that should be removed, but won't generate useful safety training data as opposed to answer in 2)
Optional tags: Does this need to be removed for rule violation?
- Contains PII/Doxxing info
- Encouraging violence or self-harm. (think it should be clear yes/no for removal purposes, don't think it's needed to have it from 1-5)
- Pornographic content (including CSAM, but just "sexual content" might be too broad)
- Discriminatory/extreme rudeness (also clear yes/no for removal purposes, don't think it's needed to have it from 1-5)
- Unedited text from other chatbots. (Expanding on #2847, removal should probably be decided on a case-by-case basis, because sometimes those answers can be good, but it's good to have it labeled anyways to make it easier to remove the "as an AI language model" stuff. The "naturalness" aspect overlaps with quality a bit, so I don't think there is a need to score that from 1-5 when quality already exists)
- (I think it's good to replace the current report system message to the tags in 1 and 2 to make it unambiguous what is and isn't allowed in the dataset and for easier report reviews)
Optional Tags: Did the content break guideline that doesn't deserve removal, but should still be labeled for review/fixes in the future?
- Potentially controversial (expect it to be mostly current event/political stuff)
- Factually inaccurate (looking for potential ability to propose edits to those in the future)
- Typos and Markdown errors. (also potential ability to propose edits to those in the future)
Required: Rate from 1-5 (simplify from 6 metrics to 3, with unambiguous rating guidelines to ensure scoring is useful for RLHF training, good to use these 3 metrics for generated chat answer ratings as well)

a. Quality:

For prompts: How effectively can one answer the prompt given its clarity of intent? 1 would be completely unclear what the prompt wants, 5 would be very clear what the prompt wants.
For answers: If you are the prompter, how satisfied would you be with this answer? 1 would be not at all, 3 would be as normally expected, 5 would be greatly exceeding expectations.

b. Bias:

For prompts: To what extent does this prompt exhibit bias? 1 would be a blatantly loaded question, 5 would be a fair question to ask. (Here for consistency, not sure if this is needed)
For answers: Is the current answer a fair response to the prompt? 1 would be very one-sided, 5 would be a fair response. (Specifically did not use "neutral" wording because taking the exact center for every position isn't ideal in a lot of cases, "fairness" would be a better metric to consider)

c. Effort/Difficulty:

For Prompts: How much effort do you think is needed to manually write a good answer for this prompt? 1 would be a simple search, 5 would be something that either requires deeply field specific knowledge or a very long time to research/write.
For answers: How much effort do you think this answer took if done manually? 1 would be a simple search on google, 5 would be requiring either deeply field specific knowledge or looks like it took a very long time to research/write.

Optional tags: Emoji reactions
- Expanding on the thumbs up/thumbs down system, determine the tone of the answer with additional emojis like funny, sad, angry, happy, love, etc. (I remember there is a Microsoft project to use ML to tag text with emoji or something, so I thought this could be a more flexible way compared to only rating how humorous/sarcastic the answer is from 1-5)

LAION-AI / Open-Assistant

RLHF Rating Rubric Revision Recommendations #2893