def get_chosen_reject(example, target_lang):
sys1_score_key = f"gpt4_{target_lang}_{data_args.cpo_scorer}"
sys2_score_key = f"alma_{target_lang}_{data_args.cpo_scorer}"
ref_score_key = f"ref_{target_lang}_{data_args.cpo_scorer}"
sys1_output_key = f"gpt4_{target_lang}"
sys2_output_key = f"alma_{target_lang}"
ref_output_key = target_lang
[...]
# Human eval
if "Delta" in example and example["Delta"] != 0:
if example["Delta"] > 0:
return example[sys1_output_key], example[sys2_output_key]
else:
return example[sys2_output_key], example[sys1_output_key]
[...]
return highest_score_sentence, lowest_score_sentence
For the scenario where Delta > 0, the provided code designates the output from GPT4_target_lang as the highest_score_sentence and the output from ALMA as the lowest_score_sentence. However, according to the dataset haoranxu/ALMA-R-Preference, it states:
**Others**
- Delta: A value of 0 indicates non-human annotated data or tied evaluations. A postive number suggests that alma_de is better than gpt4_de, vice versa
Hey, thanks for pointing out the differences! It appears that the data description in the README is wrong, and I have made it correct. Thanks again for your careful checking!
utils/utils.py
For the scenario where Delta > 0, the provided code designates the output from GPT4_target_lang as the highest_score_sentence and the output from ALMA as the lowest_score_sentence. However, according to the dataset haoranxu/ALMA-R-Preference, it states:
@fe1ixxu