confident-ai / deepeval

The LLM Evaluation Framework
https://docs.confident-ai.com/
Apache License 2.0
3.08k stars 239 forks source link

g_eval metric implicitly falls back to top score instead of sum of weighted top probabilities #1029

Open mattany opened 5 days ago

mattany commented 5 days ago

https://github.com/confident-ai/deepeval/blob/e49714ddd0fe523d27ef4c8b2477eeaeb7128ef2/deepeval/metrics/g_eval/g_eval.py#L315

It seems that the "except" clause here can cause unexpected results since the evaluation of the probabilities can fail, but the code will still return a score.

In my opinion it would be good to give the user of the metric an ability to choose between:

  1. logprobs sum mode.
  2. top token mode (for cases where the logprobs vector is not transparent in the evaluator model, or if the user simply doesn't care about the quality guarantee in the paper, which is specified for the probability sum method).

I understand that in some cases the evaluator model doesn't give access to the logprobs vector. In that case, if the user selected logprobs sum mode I would like them to see an error explaining that the probabilities are not available, and to use top token mode instead.

If there is agreement amongst the core team, I would be happy to contribute this functionality

penguine-ip commented 4 days ago

Hey @mattany that would be extremely helpful, would really appreciate a PR :)