audio-captioning / dcase-2020-baseline

Audio captioning baseline system for DCASE 2020 challenge.
http://dcase.community/challenge2020/task-automatic-audio-captioning
Other
37 stars 11 forks source link

About the baseline net #11

Closed LittleFlyingSheep closed 3 years ago

LittleFlyingSheep commented 3 years ago

In the baseline, there is a code in 'baseline_dcase.py' as h_encoder: Tensor = self.encoder(x)[:, -1, :].unsqueeze(1).expand(-1, self.max_out_t_steps, -1). Why the baseline just remain the last dim of the output of encoder?

dr-costas commented 3 years ago

Hi,

keeping the summary of the input sequence, is just a way for doing sequence-to-sequence processing.

LittleFlyingSheep commented 3 years ago

Thanks for your reply. Can I understand it as follows? the encoder (GRUs) deals the input sequence of audio feature and outputs a sequence of hidden feature. Then we select the last step of the output sequence as the summary of the whole sequence, and expand it as the max_out_t_steps.

dr-costas commented 3 years ago

The expansion for max_out_t_steps is just a way to re-use the summary for every time step of the decoder. :)

dr-costas commented 3 years ago

Hi,

I'm closing this issue. If you have any further questions, please feel free to create another issue.