some question - Githubissues

YJiangcm / FollowBench

Code for "FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)"

https://arxiv.org/abs/2310.20410

Apache License 2.0

80 stars 11 forks source link

some question #6

Open yuanzhiyong1999 opened 2 months ago

yuanzhiyong1999 commented 2 months ago

Hello, I have a question: After I executed model_inference.py and got the results, do I need to use my own model to infer all the questions before executing llm_eval.py? What will the result be after the inference is completed? Because I saw parameters such as gpt4_discriminative_eval_input_path in llm_eval.py, I don't understand how this works. Looking forward to your reply. @YJiangcm

yuanzhiyong1999 commented 2 months ago

Why doesn't the quantity under 'data' correspond to that in 'api_input'? @YJiangcm

lzzzx666 commented 2 months ago

llm_eval.py will examine the results except example_constraints. Example_constraints will be checked by rule instead of llm. After running llm_eval.py, you should run eval.py to get the final result.

zhejunliux commented 1 month ago

model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval：规则是给【评估模型结果】出值 gpt4结果那边也会出一个值最后融合，计算hrs、ssr、csl

代码写的重复度太高了，😂，读起来太费劲儿。

bittersweet1999 commented 1 week ago

model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval：规则是给【评估模型结果】出值 gpt4结果那边也会出一个值最后融合，计算hrs、ssr、csl

代码写的重复度太高了，😂，读起来太费劲儿。

我正在重构他这个代码，想问一下他这个rule base的评测和llm base的评测的题目应该是不一样的吧，应该可以拆成两份数据分别跑？他现在弄了一堆if看的晕头转向的完全没理解操作

zhejunliux commented 6 days ago

评测方式：rule base （规则） + llm base（评估模型（gpt））数据子集：它给的六类：content、example、mix etc。指标：HSR、SSR、CLS

example这个数据子集跑了（HSR, CLS）指标，特殊的，直接用了待评估模型数据跑的。def evaluate_example_constraint and def csl_evaluation 其他的五个：调用gpt跑了评估结果，拼了prompt。def discriminative_evaluation and def rule_evaluation这两跑完出了个结果，HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr. CLS: def csl_evaluation

rule base是评待评估模型(lam3 etc) 数据，llm base 是用gpt去跑lam3 的结果，一般叫打标，llm base并没有去统计结果。