the previous implementation was slightly incorrect due to padding on input_ids I think
Would have to e.g. sum across the attention mask to determine the shortest sequence. Instead it's cleaner to just set it to None imo.
You may not want to merge the mlsum changes though
the previous implementation was slightly incorrect due to padding on input_ids I think Would have to e.g. sum across the attention mask to determine the shortest sequence. Instead it's cleaner to just set it to None imo.
You may not want to merge the mlsum changes though