Closed marine-tk closed 1 year ago
Hi,
Thanks for your interest in GIT :)
For visual question answering, the question and the ground-truth answer are concatenated as a new special caption during the fine-tuning, but the LM loss is only applied on the answer and the [EOS] tokens
This is a typical trick used when fine-tuning Transformer decoders, this can be achieved by setting the labels to -100 for all tokens for which you don't want to incur a loss (as -100 is the ignore_index
of PyTorch's cross-entropy loss).
An example of how this can be done can be seen here in the Donut repo (Donut is also a model that can do VQA on images very similar to GIT). As can be seen, the labels
are a copy of the input_ids
, but then we make sure the model doesn't need to predict the prompt (like the question in case of VQA) by replacing the prompt tokens by the ignore_id
which is set to -100. Note that each training example has a different amount of prompt tokens (as the question of each training example can have a different length), hence the Donut authors check when the prompt_end_token_id
occurs in the sequence to know where the question ends and the answer starts.
Thank you for your answer! Wow, I understand better now, I will try this trick, thank you so much again!
Update: I just added this modification on my preprocessing and it seems to work fine now (I tried it on a smaller sample of TextVQA dataset)!
Hi!
I know there is already a finetuned version of GIT on TextVQA on HuggingFace, but I am personally trying to finetune GIT model for a VQA task on TextVQA (and later VQAv2) to understand how the model works and what inputs exactly it is expecting.
I have read the paper introducing GIT and I understand that the model on HuggingFace might differ since the model card was not released by the authors of the paper but I still had some questions about the model on HuggingFace since I didn't succeed at finetuning GIT on TextVQA:
Does it mean that during the training, I should give as
input_ids
the tokenized question+answer concatenated, and aslabels
only the answer or is there something else to do so that the LM loss is only applied on the answer and the[EOS]
tokens? I was not really sure how to interpret this sentence but I tried with different inputs and none of what I tried worked:input_ids=tokenized questions
andlabels=tokenized answers
but the model wouldn't generate answersinput_ids=tokenized and concatenated question+answer
andlabels=tokenized answers
but the generated answers were wrong and quite random (I tried with 50, 100, 200 epochs). Since it's a causal model (and that it is mentionned in the paper), I guess this should have been the best solutionThe loss is decreasing in both cases so I'm not sure that I give the right inputs to the model as I can't obtain correct answers like
microsoft/git-base-textvqa
To add a bit more of context, for my training and preprocessing of the data, I inspired myself from your tutorial on how to finetune GIT for image captioning and tried to adjust it for a VQA task (thank you for all the tutorials!!!), so here is a snippet of my code:
Pre-processing
Model
Thank you in advance and have a good day!