Issue in reproducing CodeReviewer results

monilouise commented 1 month ago

Hi,

I've found your paper "Parameter Efficient Fine-Tuning of Pre-trained Code Models for Just-in-Time Defect Prediction." I'm trying to reproduce your results with CodeReviewer. Still, I came up with the following issue: it seems you use (default EOS token in T5 according to some documentation) as the "file separator" in the change information. However, the code is throwing the following error with the current transformer version:

ValueError: All examples must have the same number of tokens.

Would you happen to have any workaround?

Thanks.

YaserAlOsh commented 1 month ago

Hi Thanks for reaching out This error usually means the padding has some issue. It’s probably in the data collator part. If you can please do some quick research on it for the current transformers version on what changed that could have caused this error. It’s probably unrelated to the EOS token used in file separation. I will also have a look in my free time.

Best, Yaser

From: Monique Monteiro @.> Sent: Friday, October 18, 2024 4:51:32 AM To: YaserAlOsh/JIT-SDP-CodePTMs @.> Cc: Subscribed @.***> Subject: [YaserAlOsh/JIT-SDP-CodePTMs] Issue in reproducing CodeReviewer results (Issue #1)

Hi,

I've found your paper "Parameter Efficient Fine-Tuning of Pre-trained Code Models for Just-in-Time Defect Prediction." I'm trying to reproduce your results with CodeReviewer. Still, I came up with the following issue: it seems you use (default EOS token in T5 according to some documentation) as the "file separator" in the change information. However, the code is throwing the following error with the current transformer version:

ValueError: All examples must have the same number of tokens.

Would you happen to have any workaround?

Thanks.

— Reply to this email directly, view it on GitHubhttps://github.com/YaserAlOsh/JIT-SDP-CodePTMs/issues/1, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCBNA5UWXGIG2H5FHUYCZLZ4BLRJAVCNFSM6AAAAABQE32YEWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4TMMJQGI3TQNQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

monilouise commented 1 month ago

Hi @YaserAlOsh ,

Adding more detail to the issue, the error says:

/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict) 2071 2072 if len(torch.unique_consecutive(eos_mask.sum(1))) > 1: -> 2073 raise ValueError("All examples must have the same number of tokens.") 2074 batchsize, , hidden_size = sequence_output.shape 2075 sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :]

The Transformers version is 4.45.2.

I manually inspected some lists of input IDs, and in fact, the EOS token appears more than once, perhaps due to file separation. Strangely, even downgrading the libraries below, the same error occurs.

transformers=4.35.2 tokenizers=0.15.0 datasets=2.14.5

/usr/local/lib/python3.10/dist-packages/transformers/models/t5/modeling_t5.py in forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict) 2077 2078 if len(torch.unique_consecutive(eos_mask.sum(1))) > 1: -> 2079 raise ValueError("All examples must have the same number of tokens.") 2080 batchsize, , hidden_size = sequence_output.shape 2081 sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :]

ValueError: All examples must have the same number of tokens.

Finally, I'm running these tests on Google Colab (A100).

YaserAlOsh / JIT-SDP-CodePTMs

Issue in reproducing CodeReviewer results #1