YJiangcm / FollowBench

Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
https://arxiv.org/abs/2310.20410
Apache License 2.0
82 stars 11 forks source link

some questions #3

Closed AccidM closed 4 months ago

AccidM commented 4 months ago

Thank you for proposing this interesting benchmark.

After finishing the Model Inference and LLM-based Evaluation, we tried to obtain the results as shown in Merge Evaluation and Save Results. However, a lot of problems occurred:

  1. There are A LOT OF ERROR:gpt4_based_evaluation or You must manually fix the evaluation. Is it because of the lack of robustness of your prompt? And how to solve them? Do we really have to do the judgement mannully as there are dozens of errors?
  2. What does Content: error mean and how to solve them? It seems different from the above problems.
  3. Why the Satisfaction Rate values can be minors?
YJiangcm commented 4 months ago

Thanks for your interest in our work!

For your first and second questions, the errors you are encountering are due to parsing failures in the function def paring_discriminative_generation(generation, level) in code/gpt4_based_evaluation.py. This function is to parse the evaluator's response and output the satisfaction rate values. Initially, our experiments used the "gpt-4-0613" version for evaluation. The function may have issues processing different formatting in other versions of GPT-4, leading to these errors. We have since modified the function to enhance its robustness and handle various formats more effectively. Please try using the updated code, which should resolve these parsing errors.

For your third question, the satisfaction rate values can appear as negative because the function def paring_discriminative_generation(generation, level) is designed to return -1 in case of an exception. This means that if there is an error in parsing the evaluator's response, the function defaults to a satisfaction rate of -1 to indicate the issue.