Open penguine-ip opened 8 months ago
+1. A paved path for generating metrics/scores covered by Open LLM Leaderboard and LLM Safety Leaderboard (and maybe a few others) can be a great bonus for the framework
@rohinish404 We're trying to support multiple popular benchmarks for users to have instant access to through deepeval. TruthfulQA is a popular and also extremely easy to integrate: https://github.com/sylinrl/TruthfulQA/blob/main/data/v0/TruthfulQA.csv
If users can just have access to truthfulQA like this:
from deepeval.benchmarks import TruthfulQA
truthful_qa = TruthfulQA()
truthful_qa.evaluate(...)
In .benchmarks:
from deepeval.dataset import EvaluationDataset
class TruthfulQA(EvaluationDataset):
pass
Does this make sense to you?
@rohinish404 We're trying to support multiple popular benchmarks for users to have instant access to through deepeval. TruthfulQA is a popular and also extremely easy to integrate: https://github.com/sylinrl/TruthfulQA/blob/main/data/v0/TruthfulQA.csv
If users can just have access to truthfulQA like this:
from deepeval.benchmarks import TruthfulQA truthful_qa = TruthfulQA() truthful_qa.evaluate(...)
In .benchmarks:
from deepeval.dataset import EvaluationDataset class TruthfulQA(EvaluationDataset): pass
Does this make sense to you?
Yup let me have a look at the codebase and then will ask further in discord.
Currently, users of
deepeval
can only create their own evaluation dataset/test cases. To support more users fine-tuning their model,deepeval
should be able to import standard benchmarks such as MMLU, hellaswag, TruthfulQA, Big Bench, etc.Comment or DM on discord to discuss more: https://discord.com/invite/a3K9c8GRGt