issues
search
carlini
/
yet-another-applied-llm-benchmark
A benchmark to evaluate language models on questions I've previously asked them to solve.
GNU General Public License v3.0
798
stars
60
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Noisy code extraction
#18
1wheel
closed
2 days ago
2
Enable support for Groq models
#17
simveit
closed
1 day ago
1
Benchmark for some open source models
#16
davideuler
opened
2 months ago
3
hparams key now matched LLM name
#15
RyanSaxe
closed
3 months ago
1
How could I view results table for multiple models?
#14
davideuler
closed
2 months ago
6
Check out pragmatics
#13
phibenz
closed
3 months ago
1
Would dspy help the benchmark
#12
davideuler
closed
3 months ago
1
Update RustRun() Docstring
#11
grantmwilliams
closed
3 months ago
1
Would you want to make a leaderboard for this?
#10
clefourrier
opened
3 months ago
3
[Test] Sentence generation
#9
LeonEricsson
opened
4 months ago
1
DockerJob TTY Error
#8
fostiropoulos
opened
4 months ago
1
Improve AWSV6 test evaluation
#7
daulet
closed
3 months ago
1
WhatIsStarStar is too strict
#6
dbieber
closed
4 months ago
1
add gemma model
#5
ViswanathaReddyGajjala
opened
4 months ago
1
Add moonshot model
#4
lychees
closed
4 months ago
1
tests: merge & merge conflict
#3
alexisgauba
closed
4 months ago
4
minor fix to successfully run a single test case
#2
ViswanathaReddyGajjala
closed
4 months ago
3
Update evaluator.py
#1
Evanc123
closed
4 months ago
0