Integration of LM Evaluation Harness.

Anindyadeep commented 9 months ago

Harness being one of the general evaluation frameworks for hundreds of tasks and benchmarks on different types of metrics.

check LM EValuation Harness here

A general evaluation of LLMs on general tasks is very much important, when doing research and also doing pre-production checks. Some general tasks like toxicity, or evaluating LLMs on summarization etc, is super important.

Since, Harness is very much CLI based, integrating this to deepeval in the form of a modular pipeline, can be super helpful for doing CI checks during fine-tuning LLMs or pre-productions steps.

Expected things:

Terminal based output
DeepEval check

penguine-ip commented 9 months ago

Sounds great, please add it in this module: https://github.com/confident-ai/deepeval/tree/main/deepeval/check. The entry point should be the check function in check.py, let me know if you have any questions!

PS. you might want to put the terminal output logic inside the check function to avoid repetitive code.

Anindyadeep commented 9 months ago

Sounds great, please add it in this module: https://github.com/confident-ai/deepeval/tree/main/deepeval/check. The entry point should be the check function in check.py, let me know if you have any questions!

PS. you might want to put the terminal output logic inside the check function to avoid repetitive code.

sounds good.

confident-ai / deepeval

Integration of LM Evaluation Harness. #332