acambray / GroundeR-PyTorch

This is an implementation of "Grounding of Textual Phrases in Images by Reconstruction" in PyTorch
MIT License
17 stars 1 forks source link

Have You got the unsupervised performance reported in paper? Thanks #2

Open youngfly11 opened 4 years ago

acambray commented 4 years ago

To answer your question I will copy and paste some text from the report of this project:

Weak Supervision (Only reconstruction Loss) Firstly, we look at the "unsuper- vised" model. That is, where no region-phrase correspondence ground truth is used when training and the learning is guided by the reconstruction loss signal. Figure 4 shows the loss profile and Figure 5 shows the accuracy profile during training. It can be seen that the validation loss decays until approximately 125 epochs and then starts increasing, which may indicate overfitting to the training set. On the other hand, validation accuracy reaches a maximum at about 350 epochs reaching figures of 29% detection rate. The test detection rate of the best validation model was of 28.1%. We found that the unsupervised version of the model often times focuses attention on blank regions i.e. padding regions initialised with zeros. This effectively means that it’s learning to ignore the visual input and increase it’s reliance on the language model and teacher forcing inputs. We consider that this learning failure mode is significant enough to warrant further investigation and is not mentioned in the original paper.

youngfly11 commented 4 years ago

Thanks for your reply. Where can I find your report? I want to have a detail look

bobwan1995 commented 4 years ago

Thank you for the fancy job! I'm also interested in your findings. If it works as you said, always attend to blank regions, how can it reach a performance ~30%? In other words, in which cases can this model work? Looking forward to your reply!

acambray commented 4 years ago

I am not able to share my full report right now, sorry about that.

If it works as you said, always attend to blank regions, how can it reach a performance ~30%?

Essentially, when training using teacher forcing in the decoder, the model learns the underlying conditional distribution of consecutive words when reconstructing the phrase. That is, it will learn to most times say 'a' at the beginning of a phrases, it will learn to most times say followup 'a' with 'man' in the second as those are very likely. Another example, follow up 'field' from previous predicted 'green', etc. That is not to say that it fully ignores the visual input (which would result in a visual performance equal to that of chance), but that the model has non-optimal performance in visual localisation because it has found a way to optimise the problem without relying too much on the visual patches versus the overall underlying text distribution i.e. it is ignoring visual context when it comes to reconstructing phrases. At least that is my reasoning to explain the results seen.

Back then, I wrote as an object of further work:

Investigate major failure mode on the weakly supervised mode. The attention module learns to focus on padding regions (initialised as zeros) which effectively reduces the amount of visual context it relies on in favour of the language model learnt via teacher forcing. The implementation of greedy decoding or beam search decoding when reconstructing may alleviate this issue due to an associated increase on the reliance of the entire decoded sequence on the visual context.

I am not sure in which cases the model does perform well, I assume they would be the 'easiest' and most commonly repeated entities like humans, vehicles, etc. By easiest I mean consistency in terms of appearance, or entities which stand out i.e. entities with visual appearances that are easy to discriminate against others consistently.

I am happy to be corrected if my reasoning has some flaws.

bobwan1995 commented 4 years ago

I am not sure in which cases the model does perform well, I assume they would be the 'easiest' and most commonly repeated entities like humans, vehicles, etc. By easiest I mean consistency in terms of appearance, or entities which stand out i.e. entities with visual appearances that are easy to discriminate against others consistently.

Yes, I have the same observation as you. I found the most commonly repeated entities like humans, vehicles are easy to be recalled. Thank you for your discussion!

xugy16 commented 3 years ago

Really appreciate for your elegant code. I only have one question, maybe we can mask the black regions in the implementations?