-
-
Hi folks,
I am trying to run HF-Leaderboard (v2) evals locally, and according to the blog https://huggingface.co/spaces/open-llm-leaderboard/blog the scores are normalized and random prediction acc…
-
Not 100% sure, but right now it seems like if I updated a prompt fragment(s), these changes will get propagated to other producers. This is slightly problematic for evals, because ideally we'd have a …
-
I did some smaller benchmarks (more like tests, really) and would like to continue with this endeavor to evaluate capabilities and weak spots.
Would also be interesting to test on codegen tasks vs …
-
Form for frosh to provide a list of social events they have attended and provide other comments for reference during 6 weeks
-
## Description
How the evaluation results get delivered is crucially important. This spike covers what a "model card" would look like for evaluating a model against our framework. The "model card" sh…
-
This is a major feature release.
Spec: https://github.com/MadcowD/ell/blob/cd64ab9bb0d3a09195fef7a32ef77ac5d7e6c912/docs/ramblings/evalspec.md
Ramblings: https://github.com/MadcowD/ell/blob/cd64ab9…
-
Hey, your work is excellent! But I have a question about your sample_pipline.py,you construct an object sample_pipline, but never call it, and lack the parameter path, are you missing this part of the…
-
Hi there,
maybe I miss something completely, but I do not understand the meaning of `samples` and `evals` for the benchmarks.
If I run this code in REPL
```
using BenchmarkTools
BenchmarkTo…
-
**Describe the bug**
Unable to generate share URL . Share button keeps showing infinite processing
**To Reproduce**
Steps to reproduce the behavior, including example Promptfoo configurations if …