Finetune Git for a VQA-like task

Hi, I am trying to Finetune Git model for a VQA-like task. This task needs one text input and one image, then outputs one text. When training , the loss is decreasing , but the result fo the dev dataset is getting worse. Maybe I make some misstakes in the data processing. How to orginze the data?

processor_image = self.processor(images=self.image_list[index],return_tensors="pt")

input_processor = self.processor(text=self.input_sentence_list[index], max_length=self.max_len,truncation=True,padding="max_length", add_special_tokens=False , return_tensors="pt")

label_process = self.processor( text=self.aspect_label[index], max_length = self.max_len,truncation=True,padding="max_length", return_tensors="pt")

return {"input_ids": input_processor.input_ids.squeeze(), "attention_mask": input_processor.attention_mask.squeeze(), "pixel_value": processor_image.pixel_values.squeeze(), "label": label_process.input_ids.squeeze(), }

What should I do?

NielsRogge / Transformers-Tutorials

Finetune Git for a VQA-like task #262