dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.41k stars 208 forks source link

Finetune is failing ValueError: operands could not be broadcast together with shapes (384,576) (3,) #61

Open amitkayal opened 2 years ago

amitkayal commented 2 years ago

Hi, I am trying to retrain the network further on same VQA dataset and this is failing with error

` data = [self.dataset[idx] for idx in possibly_batched_index] 50 else: 51 data = self.dataset[possibly_batched_index]

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in (.0) 47 def fetch(self, possibly_batched_index): 48 if self.auto_collation: ---> 49 data = [self.dataset[idx] for idx in possibly_batched_index] 50 else: 51 data = self.dataset[possibly_batched_index]

in getitem(self, idx) 22 text = questions['question'] 23 ---> 24 encoding = self.processor(image, text, padding="max_length", truncation=True, return_tensors="pt") 25 # remove batch dimension 26 for k,v in encoding.items():

/usr/local/lib/python3.7/dist-packages/transformers/models/vilt/processing_vilt.py in call(self, images, text, add_special_tokens, padding, truncation, max_length, stride, pad_to_multiple_of, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, return_tensors, **kwargs) 89 ) 90 # add pixel_values + pixel_mask ---> 91 encoding_feature_extractor = self.feature_extractor(images, return_tensors=return_tensors) 92 encoding.update(encoding_feature_extractor) 93

/usr/local/lib/python3.7/dist-packages/transformers/models/vilt/feature_extraction_vilt.py in call(self, images, pad_and_return_pixel_mask, return_tensors, **kwargs) 264 ] 265 if self.do_normalize: --> 266 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images] 267 268 if pad_and_return_pixel_mask:

/usr/local/lib/python3.7/dist-packages/transformers/models/vilt/feature_extraction_vilt.py in (.0) 264 ] 265 if self.do_normalize: --> 266 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images] 267 268 if pad_and_return_pixel_mask:

/usr/local/lib/python3.7/dist-packages/transformers/image_utils.py in normalize(self, image, mean, std) 218 return (image - mean[:, None, None]) / std[:, None, None] 219 else: --> 220 return (image - mean) / std 221 222 def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):

ValueError: operands could not be broadcast together with shapes (384,576) (3,) `

YorkNishi999 commented 2 years ago

@amitkayal @dandelin Hello. I got the same error. Have you solved this?

Ellyuca commented 1 year ago

Any solution to this issue? Thanks.