Please add flag to log score for each sample (akin to Eleuther's LM Evaluation Harness)

bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.

Apache License 2.0

698 stars 180 forks source link

Please add flag to log score for each sample (akin to Eleuther's LM Evaluation Harness) #215

Open RylanSchaeffer opened 2 months ago

RylanSchaeffer commented 2 months ago

Hi! I've been using EleutherAI's LM Evaluation Harness and I'd like to be able to also runs some code tasks using your Big Code Evaluation Harness. We need the scores for each sample in each benchmark and the LM Evaluation Harness has a helpful flag log_samples that activates logging the per-sample scores.

As best as I can tell (and please correct me if I'm wrong), Big Code's Evaluation Harness doesn't have a similar flag. If my understanding is correct, could this please be added?

Thank you!

RylanSchaeffer commented 2 months ago

If making this change is relatively simple, you can let me know how to make the change and I can open a PR?

shuhao02 commented 2 months ago

Hi, I encounter the same need.

I find that the code can be easily modified to log the per-sample score.

Take the task humaneval as an example, the second return value in compute_code_eval (called in process_result: link) will return the logs (including whether the output is passed) of sample as a list.

In the original implementation, they just use a _ to accept the returned logs. If you want to output the logs to a file, you can just save the value by json.dump.

loubnabnl commented 2 months ago

Yes as @shuhao02 explained and as I mention in this issue you can dump the results returned by code_eval metric https://github.com/bigcode-project/bigcode-evaluation-harness/issues/211#issuecomment-2027100342

results, details = compute_code_eval(..)
details.to_json("details.json")

Feel free to make a PR to add a flag for the harness, but note that not all tasks use this metric