bigcode-project bigcode-evaluation-harness issues

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

781 stars 208 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Speedup execute.py: Reuse same manager and dict in

#277 michaelfeil opened 2 days ago
0
trust_remote_code is not been passed to dataset

#276 login256 opened 1 week ago
1
ValueError: Infilling not yet supported for:/Meta-Llama-3.1-8B

#275 kbmlcoding opened 1 week ago
0
Update `fsspec`

#274 younesbelkada opened 1 week ago
1
MultiPL-E now supports Dart, OCaml, Elixir, Haskell, Clojure

#273 arjunguha closed 1 week ago
2
Evaluation result of bigcode/starcoder2-3b on gsm8k_pal does not matched the paper

#272 nongfang55 opened 3 weeks ago
0
Evaluating a Model with a Local Dataset in an Offline Environment

#271 ankush13r opened 3 weeks ago
8
Any pypi package for this tool?

#270 zhimin-z opened 3 weeks ago
0
fix typo bug in mbppplus

#269 zkcpku opened 4 weeks ago
0
Unable to execute the `MultiPL-E` task for `python` language

#268 manthan0227 opened 4 weeks ago
0
ImportError: cannot import name 'SyncManager' from partially initialized module 'multiprocessing.managers'

#267 xinghuang2050 opened 1 month ago
1
What is `fine-tuning` in task submission?

#266 zhimin-z opened 1 month ago
1
Downgrade pyext dependency

#265 shehrozek-cerebras closed 1 month ago
1
Error of testing codegeex4-all-9b model

#264 Gumingbro opened 1 month ago
1
Basecodes

#263 Abhineetsoccer opened 1 month ago
0
Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team

#262 jessyford opened 1 month ago
2
HumanEval evaluation results mismatch for Codegemma-2b

#261 berserank opened 2 months ago
1
Add a new benchmark ENAMEL for evaluating the efficiency of LLM-generated code

#260 q-rz opened 2 months ago
0
`multiple-cs`, `multiple-go`, and `multiple-d` are broken (`FileNotFound`)

#259 alat-rights closed 2 months ago
2
[Possibly system specific] Wild (12% vs 20%) run-to-run swings in `multiple-cpp` reported scores

#258 alat-rights opened 2 months ago
1
Fix Max New Tokens in HF's Generation Config

#257 mostafaelhoushi opened 2 months ago
2
Need some context for certain args for Instruct Human Eval

#256 teknium1 opened 2 months ago
2
Update Dockerfile-multiple

#255 arjunguha closed 2 months ago
0
Update MultiPL-E to v3 prompts

#254 arjunguha closed 2 months ago
0
[draft]save prompts and tests passed/failed

#253 kbmlcoding closed 2 months ago
0
The evaluation results are inconsistent across different GPUs

#252 DonteFlynn opened 2 months ago
0
Using the humanevalpack to test the ChatGLM3 model results in an abnormal score.

#251 burger-pb opened 3 months ago
0
Hi

#250 burger-pb closed 3 months ago
0
Fix unnecessary repeated overwrite

#249 nielstron opened 3 months ago
0
Updating SparseML loading to SparseAutoModel

#248 abhinavnmagic closed 3 months ago
0
Fix: Leaderboard submission Documentation

#247 anil-gurbuz closed 3 months ago
0
MBPP Llama3-8B-Instruct lower pass@1 score expected

#246 YangZhou08 opened 3 months ago
1
Using custom prompts and postprocessing

#245 anil-gurbuz opened 3 months ago
1
Adding support for transformers>=4.40.2 to avoid crash with mbpp

#244 meher-m closed 3 months ago
2
Value error regarding to "--max_length_generation"

#243 aladinggit closed 3 months ago
1
Following submission documentation fails to save generated model outputs for our(Jdoodle) private HF model

#242 anil-gurbuz closed 3 months ago
3
fix: rust timeout exception

#241 seobeomjin closed 3 months ago
1
Some questions about APPS

#240 virt9 closed 3 months ago
1
Changes

#239 meher-m closed 4 months ago
0
Add a new dataset Mercury

#238 Elfsong closed 4 months ago
2
How can I pass num_beams?

#237 Sunt-ing closed 3 months ago
2
Fix MBPP bug with transformers 4.38+

#236 edgan8 closed 3 months ago
6
Run 70b evaluation.

#235 icoderzqliu closed 4 months ago
2
API-based evaluation support (humanevalpack_openai.py is too old)

#234 s-natsubori opened 4 months ago
0
The result of llama2-13b-chat(pass@1 18) is worse than paper(pass@1 37)

#233 moyi-qwq closed 4 months ago
2
Add package name and enforce python version in `setup.py`

#232 shehrozek-cerebras closed 5 months ago
4
Make `bigcode_eval` pip installable

#231 shehrozek-cerebras closed 3 months ago
1
If I want to add my own designed prompts before each question, how should I modify the code

#230 ALLISWELL8 opened 5 months ago
1
Add StudentEval from LLM4Code 2024

#229 arjunguha closed 5 months ago
0
The results of Llama3-8b pass@1 is worse than report

#228 shuaiwang2022 opened 5 months ago
6