IBM / unitxt

🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
https://unitxt.rtfd.io
Apache License 2.0
139 stars 29 forks source link

Add example of using LLM as a judge for summarization dataset. #965

Closed eladven closed 5 days ago

yoavkatz commented 5 days ago

Overall looks good. Yet do we really want users to write templates for each task being evaluated? I think a better model is that we will guide them to use something like: "card=cards.xsum,metrics=[metrics.llm_as_judge]"

Yes. I agree. We already have an example of adding an llm metric definition and using it. Here , I think we should use a simple predefined metric. We should show one metric that uses the reference answr and ibe that does not.

codecov[bot] commented 5 days ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.35%. Comparing base (31f7d4b) to head (7445aa5). Report is 8 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #965 +/- ## ========================================== + Coverage 91.33% 91.35% +0.01% ========================================== Files 110 112 +2 Lines 11704 11794 +90 ========================================== + Hits 10690 10774 +84 - Misses 1014 1020 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.