Open RylanSchaeffer opened 2 months ago
If making this change is relatively simple, you can let me know how to make the change and I can open a PR?
Hi, I encounter the same need.
I find that the code can be easily modified to log the per-sample score.
Take the task humaneval as an example, the second return value in compute_code_eval (called in process_result: link) will return the logs (including whether the output is passed) of sample as a list.
In the original implementation, they just use a _
to accept the returned logs. If you want to output the logs to a file, you can just save the value by json.dump
.
Yes as @shuhao02 explained and as I mention in this issue you can dump the results returned by code_eval
metric https://github.com/bigcode-project/bigcode-evaluation-harness/issues/211#issuecomment-2027100342
results, details = compute_code_eval(..)
details.to_json("details.json")
Feel free to make a PR to add a flag for the harness, but note that not all tasks use this metric
Hi! I've been using EleutherAI's LM Evaluation Harness and I'd like to be able to also runs some code tasks using your Big Code Evaluation Harness. We need the scores for each sample in each benchmark and the LM Evaluation Harness has a helpful flag
log_samples
that activates logging the per-sample scores.As best as I can tell (and please correct me if I'm wrong), Big Code's Evaluation Harness doesn't have a similar flag. If my understanding is correct, could this please be added?
Thank you!