Issues related to the poor evaluation of long texts

Superyeahh commented 4 days ago

[ ] I checked the documentation and related resources and couldn't find an answer to my question.

Your Question (1) For cases where answer, context and groud_truth are long (about 500 words each), how can the code be modified to make the assessment more effective, currently there are some metrics such as context_recall, faithfulness results are empty (not 0.0). (2) a large number of answer_relevancy value of 0.0 is normal (3) Is it also normal to have only two context_precision values of 0.9999999 and 0.0? Thanks for your answer!

Nandhu-Ramesh-07 commented 2 days ago

Are you using the openai models?

Superyeahh commented 2 days ago

你在使用 openai 模型吗？

No, I guess the prompt words for the evaluation indicator given in the code are in English and don't work well for other languages. I translated the answer to the question into English and now the problem is mostly solved.

explodinggradients / ragas

Issues related to the poor evaluation of long texts #1590