-
I'm opening this issue to discuss about what we think the "LLM task" framework should aim to be, and how we could incrementally get there.
## What we have today
Today, what we call the "task framewo…
-
### Background:
[Langsmith Evaluations](https://docs.smith.langchain.com/concepts/evaluation) are a way to evaluate the performance of Automatic Import.
The [evaluation framework](https://docs.smith…
-
We currently leverage some llm based evaluation metrics from ragas: https://github.com/explodinggradients/ragas
namely, `llm_context_precision`, `llm_context_recall` and `llm_answer_relevance` in thi…
-
Many thanks for making this feature available. It's a great help.
I wanted to let you know that your HuggingFace [CyberSecEval: Comprehensive Evaluation Framework for Cybersecurity Risks and Capab…
-
### **Is your feature request related to a problem? Please describe.**
PyRIT currently lacks built-in support for easily using and comparing multiple LLM providers. This makes it challenging for user…
-
Integrate MDEL with various evaluation framework
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [helm](https://github.com/stanford-crfm/helm)
-
- [ ] [[2308.07201] ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https://arxiv.org/abs/2308.07201)
# [ChatEval: Towards Better LLM-based Evaluators through Multi-Agent De…
-
**What would you like to be added/modified:**
1. Build a collaborative code intelligent agent alignment dataset for LLMs:
- The dataset should include behavioral trajectories, feedback, and i…
-
Hi there,
Thank you for bringing the elegant RAG Assessment framework to the community.
I am an AI engineer from Alibaba Cloud, and our team has been fine-tuning LLM-as-a-Judge models based on t…
-
This issue is now to track the implementation of various evaluation methods and workflows for LLMs.
Evaluations:
- [x] G-Eval
- [ ] PingPong
- [ ] InfiniteBench
- [ ] Ruler
- [ ] MMLU
- [ ] M…