Open yuanzhiyong1999 opened 2 months ago
Why doesn't the quantity under 'data' correspond to that in 'api_input'? @YJiangcm
llm_eval.py will examine the results except example_constraints. Example_constraints will be checked by rule instead of llm. After running llm_eval.py, you should run eval.py to get the final result.
model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval: 规则是给【评估模型结果】出值 gpt4结果那边也会出一个值 最后融合,计算hrs、ssr、csl
代码写的重复度太高了,😂,读起来太费劲儿。
model_inference.py 是要评估的模型 llm_eval.py 是将评估模型结果经过处理喂给gpt4 eval: 规则是给【评估模型结果】出值 gpt4结果那边也会出一个值 最后融合,计算hrs、ssr、csl
代码写的重复度太高了,😂,读起来太费劲儿。
我正在重构他这个代码,想问一下他这个rule base的评测和llm base的评测的题目应该是不一样的吧,应该可以拆成两份数据分别跑?他现在弄了一堆if看的晕头转向的完全没理解操作
评测方式:rule base (规则) + llm base(评估模型(gpt)) 数据子集:它给的六类:content、example、mix etc。 指标:HSR、SSR、CLS
example这个数据子集跑了(HSR, CLS)指标,特殊的,直接用了待评估模型数据跑的。def evaluate_example_constraint
and def csl_evaluation
其他的五个:调用gpt跑了评估结果,拼了prompt。def discriminative_evaluation
and def rule_evaluation
这两跑完出了个结果,HSR、SSR数组取的值不一样。discriminative_result[0] --> hsr;discriminative_result[1] --> ssr
. CLS: def csl_evaluation
rule base是评待评估模型(lam3 etc) 数据,llm base 是用gpt去跑lam3 的结果, 一般叫打标,llm base并没有去统计结果。
Hello, I have a question: After I executed model_inference.py and got the results, do I need to use my own model to infer all the questions before executing llm_eval.py? What will the result be after the inference is completed? Because I saw parameters such as gpt4_discriminative_eval_input_path in llm_eval.py, I don't understand how this works. Looking forward to your reply. @YJiangcm