YJiangcm / FollowBench

Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"
https://arxiv.org/abs/2310.20410
Apache License 2.0
80 stars 11 forks source link

some question #6

Open yuanzhiyong1999 opened 2 months ago

yuanzhiyong1999 commented 2 months ago

Hello, I have a question: After I executed model_inference.py and got the results, do I need to use my own model to infer all the questions before executing llm_eval.py? What will the result be after the inference is completed? Because I saw parameters such as gpt4_discriminative_eval_input_path in llm_eval.py, I don't understand how this works. Looking forward to your reply. @YJiangcm

yuanzhiyong1999 commented 2 months ago

Why doesn't the quantity under 'data' correspond to that in 'api_input'? image @YJiangcm

lzzzx666 commented 2 months ago

llm_eval.py will examine the results except example_constraints. Example_constraints will be checked by rule instead of llm. After running llm_eval.py, you should run eval.py to get the final result.

zhejunliux commented 1 month ago

model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval: 规则是给【评估模型结果】出值 gpt4结果那边也会出一个值 最后融合,计算hrs、ssr、csl

代码写的重复度太高了,😂,读起来太费劲儿。

bittersweet1999 commented 1 week ago

model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval: 规则是给【评估模型结果】出值 gpt4结果那边也会出一个值 最后融合,计算hrs、ssr、csl

代码写的重复度太高了,😂,读起来太费劲儿。

我正在重构他这个代码,想问一下他这个rule base的评测和llm base的评测的题目应该是不一样的吧,应该可以拆成两份数据分别跑?他现在弄了一堆if看的晕头转向的完全没理解操作

zhejunliux commented 6 days ago

评测方式:rule base (规则) + llm base(评估模型(gpt)) 数据子集:它给的六类:content、example、mix etc。 指标:HSR、SSR、CLS

example这个数据子集跑了(HSR, CLS)指标,特殊的,直接用了待评估模型数据跑的。def evaluate_example_constraint and def csl_evaluation 其他的五个:调用gpt跑了评估结果,拼了prompt。def discriminative_evaluation and def rule_evaluation这两跑完出了个结果,HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr. CLS: def csl_evaluation

rule base是评待评估模型(lam3 etc) 数据,llm base 是用gpt去跑lam3 的结果, 一般叫打标,llm base并没有去统计结果。