Strange visual attention results

DeepRNN / image_captioning

Tensorflow implementation of "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention"

MIT License

785 stars 354 forks source link

Strange visual attention results #32

Open mojesty opened 6 years ago

mojesty commented 6 years ago

Hello @DeepRNN ! I took a look at attentions that model generates in test mode. I did the following: in base_model.py:200 i changed the code as following

                memory, output, scores, attentions = sess.run(
                    [self.memory, self.output, self.probs, self.attentions],
                    feed_dict = {self.contexts: contexts,
                                 self.last_word: last_word,
                                 self.last_memory: last_memory,
                                 self.last_output: last_output})

So after that, every attentions array has the shape batch_size, 196, beam_size and for simplicity I set beam_size=1 when testing. Next, I simply stack all attentions in one numpy array and visualize its content. I found two concerns:

Attentions maps for different tokens vary negligibly (for 1.jpg the maximum difference between 1st and 2nd tokens is ~e^-9).
The maps itself looks rather strange. For the test image with bus:

However, the caption of the image is both grammatically and semantically correct. I would like to discuss these results.

JiayunLi commented 6 years ago

Yeah. I encountered the same problem. I find that the self.attentions is only defined in the train mode, so I also add self.attentions in the test mode. However, the difference between attention maps is still negligible.

mojesty commented 6 years ago

May this occur because of some mistake that makes attention shared across all timestamps? 1e-9 difference can be explained them. If so, how can it be fixed?

RoronoaZA commented 6 years ago

@JiayunLi I think self.attentions is only used to be shown.

JiayunLi commented 6 years ago

@mojesty I haven't figured it out.

JiayunLi commented 6 years ago

@RoronoaZA Yeah. Are you able to get reasonable attention maps for evaluation images?

minizon commented 5 years ago

Same problem

bright1993ff66 commented 5 years ago

Hi @mojesty , how long does it take to train one epoch? What machines did you use? Thank you!

chansongoal commented 4 years ago

Has anyone solve the problem? Anyone can help me? Thank you all.

Fanshia commented 4 years ago

The attention layer defined in model.py is: image context (49,2048) ->fc1 -> hidden vector a (49,) word vector (num of lstm unit) ->fc2-> hideen vector b (49,) attention = softmax(hidden vector a + hidden vector b)

I don't think the attention could work, because:

hidden vector a is constant as image context never change
hidden vector b is generate from lstm output, which shouldn't contain location information

So in the end, fc2 will generate very small hidden vector b, and attention is mainly depend on the hidden vector a. That's why you observe attention maps are pretty similar across token.