Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
1.41k
stars
208
forks
source link
Finetune is failing ValueError: operands could not be broadcast together with shapes (384,576) (3,) #61
Open
amitkayal opened 2 years ago
Hi, I am trying to retrain the network further on same VQA dataset and this is failing with error
` data = [self.dataset[idx] for idx in possibly_batched_index] 50 else: 51 data = self.dataset[possibly_batched_index]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in(.0)
47 def fetch(self, possibly_batched_index):
48 if self.auto_collation:
---> 49 data = [self.dataset[idx] for idx in possibly_batched_index]
50 else:
51 data = self.dataset[possibly_batched_index]
/usr/local/lib/python3.7/dist-packages/transformers/models/vilt/processing_vilt.py in call(self, images, text, add_special_tokens, padding, truncation, max_length, stride, pad_to_multiple_of, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, return_tensors, **kwargs) 89 ) 90 # add pixel_values + pixel_mask ---> 91 encoding_feature_extractor = self.feature_extractor(images, return_tensors=return_tensors) 92 encoding.update(encoding_feature_extractor) 93
/usr/local/lib/python3.7/dist-packages/transformers/models/vilt/feature_extraction_vilt.py in call(self, images, pad_and_return_pixel_mask, return_tensors, **kwargs) 264 ] 265 if self.do_normalize: --> 266 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images] 267 268 if pad_and_return_pixel_mask:
/usr/local/lib/python3.7/dist-packages/transformers/models/vilt/feature_extraction_vilt.py in(.0)
264 ]
265 if self.do_normalize:
--> 266 images = [self.normalize(image=image, mean=self.image_mean, std=self.image_std) for image in images]
267
268 if pad_and_return_pixel_mask:
/usr/local/lib/python3.7/dist-packages/transformers/image_utils.py in normalize(self, image, mean, std) 218 return (image - mean[:, None, None]) / std[:, None, None] 219 else: --> 220 return (image - mean) / std 221 222 def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):
ValueError: operands could not be broadcast together with shapes (384,576) (3,) `