Support popular benchmarks

confident-ai / deepeval

The LLM Evaluation Framework

https://docs.confident-ai.com/

Apache License 2.0

3.55k stars 280 forks source link

Support popular benchmarks #508

Open penguine-ip opened 8 months ago

penguine-ip commented 8 months ago

Currently, users of deepeval can only create their own evaluation dataset/test cases. To support more users fine-tuning their model, deepeval should be able to import standard benchmarks such as MMLU, hellaswag, TruthfulQA, Big Bench, etc.

Comment or DM on discord to discuss more: https://discord.com/invite/a3K9c8GRGt

Peilun-Li commented 8 months ago

+1. A paved path for generating metrics/scores covered by Open LLM Leaderboard and LLM Safety Leaderboard (and maybe a few others) can be a great bonus for the framework

penguine-ip commented 8 months ago

@rohinish404 We're trying to support multiple popular benchmarks for users to have instant access to through deepeval. TruthfulQA is a popular and also extremely easy to integrate: https://github.com/sylinrl/TruthfulQA/blob/main/data/v0/TruthfulQA.csv

If users can just have access to truthfulQA like this:

from deepeval.benchmarks import TruthfulQA

truthful_qa = TruthfulQA()
truthful_qa.evaluate(...)

In .benchmarks:

from deepeval.dataset import EvaluationDataset

class TruthfulQA(EvaluationDataset):
    pass

Does this make sense to you?

rohinish404 commented 8 months ago

@rohinish404 We're trying to support multiple popular benchmarks for users to have instant access to through deepeval. TruthfulQA is a popular and also extremely easy to integrate: https://github.com/sylinrl/TruthfulQA/blob/main/data/v0/TruthfulQA.csv

If users can just have access to truthfulQA like this:
from deepeval.benchmarks import TruthfulQA

truthful_qa = TruthfulQA()
truthful_qa.evaluate(...)
In .benchmarks:
from deepeval.dataset import EvaluationDataset

class TruthfulQA(EvaluationDataset):
    pass
Does this make sense to you?

Yup let me have a look at the codebase and then will ask further in discord.