The number of LLM evaluated examples

YJiangcm / FollowBench

Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"

Apache License 2.0

88 stars 11 forks source link

I just run below code and find that the examples need to be evaluated by LLM are not equivalent to your papers. `rule_based_source = ["E2E", "WIKIEVENTS", "CONLL2003", "text_editing", "cnn_dailymail", "xsum", "samsum", "gigaword", "arxiv", "BBH_logical", "BBH_time", "self_made_space", "gsm_8k"]

for type in ["content", "situation", "format", "example", "mixed"]: data = json.load(open(f"./data/{type}_constraints.json")) rule, llm = 0, 0 for d in data: level = d["level"] if level == 0: continue source = d["source"] if source in rule_based_source: rule += 1 else: llm += 1 print(f"type: {type}, rule: {rule}, llm:{llm}")`

Is there any misunderstanding in your paper of code?

Thanks for your reply.

YJiangcm / FollowBench

The number of LLM evaluated examples #11