Open lintangsutawika opened 4 months ago
Multi-step call would probably be more easier to implement.
@lintangsutawika why not separate multi-step/multi-round from llm-as-a-judge? usually multi-step requires the same model to generate on prompt that includes the previous generation. Seems quite different and no need for such a complicated structure to implement these tasks. Evenmore, llm-as-a-judge seems to be some kind of a metric. Multi-step tasks may require that the currect promt include all previous lm answers which leads to calling the same lm to be judge multiple times. or i got something wrong?
Llm-as-a-judge can be considered a type of metric. The point being the score typically associated with a heuristic is now outputted by a language model which may or may not be the same type/version of the model being evaluated. This isn't really related to the multistep idea but is something that could be beneficial for evaluating multistep tasks.
@lintangsutawika Is there currently any initiative for this feature? I would love to help
Yes, currently @baberabb is working on a PR for this. I think it'd be great if you can provide some feedback based on your experience on Prometheus.
@baberabb Hello Baber, nice to meet you! I'd love to collaborate with you on working on this! On which platform could I best communicate with you (e.g., slack, discord)?
@SeungoneKim @lintangsutawika @baberabb and I met to discuss LLM-as-a-Judge.
key takeaways:
How can we integrate the use of language models to evaluate language model generations?
Currently, lm-eval evaluates language model generations with conventional metrics such as accuracy, bleu, etc. This is proven to have shortcomings such as frequently being a poor proxy of progress or performance. There have been current methods such as using GPT-4 to evaluate other language model generations, other open-source approaches such as Prometheus have also gain interest.
Basic Requirements:
Advanced/Optional Requirements: