YJiangcm / FollowBench

Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
https://arxiv.org/abs/2310.20410
Apache License 2.0
88 stars 11 forks source link

The number of LLM evaluated examples #11

Open kkk-an opened 1 month ago

kkk-an commented 1 month ago

I just run below code and find that the examples need to be evaluated by LLM are not equivalent to your papers. `rule_based_source = ["E2E", "WIKIEVENTS", "CONLL2003", "text_editing", "cnn_dailymail", "xsum", "samsum", "gigaword", "arxiv", "BBH_logical", "BBH_time", "self_made_space", "gsm_8k"]

for type in ["content", "situation", "format", "example", "mixed"]: data = json.load(open(f"./data/{type}_constraints.json")) rule, llm = 0, 0 for d in data: level = d["level"] if level == 0: continue source = d["source"] if source in rule_based_source: rule += 1 else: llm += 1 print(f"type: {type}, rule: {rule}, llm:{llm}")`

image

Is there any misunderstanding in your paper of code?

Thanks for your reply.

kkk-an commented 1 month ago

I have checked my gpt4_discriminative_eval_input and find that the number of examples that need to be evaluated by LLMs are: content: 65 | mixed: 45 | format: 140 | situation: 70 but your paper just reports: content: 50 | mixed: 10 | format: 120 | situation: 55

I am very confused and kindly request your help, thank you so much.