aimagelab / meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020
BSD 3-Clause "New" or "Revised" License
520 stars 136 forks source link

Q/A visual for coding #15

Open TrungThanhTran opened 4 years ago

TrungThanhTran commented 4 years ago

Hi @baraldilorenzo,

I 'm trying to improve the speed of beam_search. When doing it, I found this function: visual = self._expand_visual(visual, cur_beam_size, selected_beam) in the iter function of beam_search.py

Please tell me what does this mean?

T.T.T

svp19 commented 4 years ago

@TranTony I found that the function expands visual (ie, repeats the tensor beam_size times) at the first step and for subsequent steps the visual fed as input is the same as the output of the function call.

Example: If I feed in visual as a FloatTensor of size(4, 50, 2048) (b_s, seq_len, d_input) and beam_size=5 than self._expand_visual returns a FloatTensor of size (20, 50, 2048) (b_s beam_size, seq_len, d_input)* at the first step during beam search.

For subsequent steps of beam search, visual of shape (20, 50, 2048) , as expected, is fed as an argument to self._expand_visual and output tensor generated is the same as the input.

old_visual = visual
visual = self._expand_visual(visual, cur_beam_size, selected_beam)
print(torch.equal(old_visual, visual))
>> True

Did I miss out anything? Also, how did you intend to speed up beam search?

TrungThanhTran commented 4 years ago

I reduce the beam_size to 1 or 2 and I found out that it achieves the same result. However, I applied to an auto annotation problem which generates about 50 words at a time. I don't think you need to reduce it. Plus, I reduce the number of connections and layers of encoder and decoder, too. About the visual, yes, its outcome is the same as your output.