bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
771 stars 201 forks source link

Add batch evaluation support when batch_size > 1 #36

Open infinitylogesh opened 1 year ago

infinitylogesh commented 1 year ago

Fixes #23

infinitylogesh commented 1 year ago

Added num_return_sequences as an argument, batch_size acting as num_return_sequences was confusing. Now the num_return_sequences will hold the number of generations per input in the batch and batch_size for the number of inputs in the batch. Hope this change is fine?. Updated the docs and examples with the argument

infinitylogesh commented 1 year ago

@loubnabnl @Muennighoff Please review and let me know your comments. Thanks. ( Do not seem to have access to request for review)

infinitylogesh commented 1 year ago

Thank you so much for the detailed review and catching this issue. I will look into it further and update !.

infinitylogesh commented 1 year ago

My updates on further analysis, Found the below to be influencing the variations in the scores ( apart from the task id repetition issue ) :

  1. Device specific seed : By default the device_specific parameter in set_seed is set to True, For the cases where num_return_sequences=n_samples, Changing the batch size might lead to a device placement of a given task in a different GPU during runtime. Thus could introduce variation in the results due to variation of seed. I have currently made the device_specific flag to False when the num_return_sequences=n_samples condition
  2. Transformers repo: generations from the model was varying by batch size even if the inputs passed to the model were ensured to be the same.I have tried to replicate the variations in this colab for Santacoder and codegen . I could see existing issues that are reported in transformer repo pointing to this behaviour (Issue1, issue 2 , issue3, issue4).Upon diving a bit deeper, I suspect that the reason for these variations are ( also showed in colab ):
    • logits from transformers are varying for the same input as the batch size varies
    • torch.multinomial used for sampling the next token returns a different next token for the same input as the batch size changes. If the same input happens to be in a different index in the batch, which is expected when batch size changes.

I am afraid only if this variation in transformers repo is handled, Our scores would be stable for varying batch sizes. Please let me know if there is any work around or suggestions.

infinitylogesh commented 1 year ago

Update ! An Update about replicating this behaviour of varying generations for different batch sizes using an external repo:

I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size.

So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments.

image
Muennighoff commented 1 year ago

Update ! An Update about replicating this behaviour of varying generations for different batch sizes using an external repo:

I used the batch generation script from incoder repo (as suggested by Daniel Fried on Slack) and was able to replicate this behaviour ( as shown below in the screenshot , full colab here). For the same set of inputs, the generations are varying based on the batch size.

So, I believe this is a global behaviour and probably is expected to happen based on my analysis in previous comments.

image

That's very odd, does it also happen for non-code models using the in-built transformer generate function with a batch? E.g. generating with https://huggingface.co/gpt2

infinitylogesh commented 1 year ago

Yes, This happens with gpt2 model too. Please check the colab it has an example with GPT2. This has been discussed in other issues too (issue1,issue2)

huybery commented 1 year ago

Any new progress? Everyone needs it. 😁