-
**What would you like to be added/modified**:
A benchmark suite for large language models deployed at the edge using KubeEdge-Ianvs:
1. Interface Design and Usage Guidelines Document;
2. Implem…
-
Hi, thanks for your great work! I have read your MovieChat+ paper and noticed that the Zero-shot QA Evaluation result of MovieChat on EgoSchema is 53.5, while the evaluation result in this CVPR paper(…
-
Hi,
We have run three Google gemma models with Winogrande on MTL or LNL, and we got much lower accuracy than Open LLM leaderboard. The detailed data as below:
Model | Precision | Device | Trans…
-
**Code**
I tried to use `evaluate` with a `LangchainLLMWrapper`, however for some it still requires an OpenAI key, here is the code:
```
from ragrank import evaluate
from ragrank.evaluation import…
-
Hello 👋
First of all thank you for the great work and evaluation results!
I have understood that in many cases you predicted outputs for each question based on the choice that minimizes the loss…
-
[ ] I have checked the [documentation](https://docs.ragas.io/) and related resources and couldn't resolve my bug.
**Describe the bug**
It's good that almost all metric in ragas can be adapt to oth…
-
When errors happen in the evaluation view, there are a couple of problems:
* The error in the llm app are shown in black and not in red as expected
* The issue is in the LLM app response and not in…
-
Below are the benchmark results on both THUDM/chatglm3-6b and openbmb/MiniCPM-2B-sft-bf16, from which we can see that chatglm3-6b has better throughput than miniCPM-2b. Considering MiniCPM-2b is a 2…
-
- [ ] [AlpacaEval: Revolutionizing Model Evaluation with LLM-Based Automatic Tools](https://github.com/tatsu-lab/alpaca_eval?tab=readme-ov-file#making-a-new-evaluator)
# AlpacaEval: Revolutionizing M…
-
## User story
1. As a data engineer,
2. I want / need to implement and automate the calculation of key performance metrics
3. So that we can iteratively evaluate the performance of our LLM in answerin…