bottleneck fusion - Githubissues

Hi, thanks for your good job.

# Latent Fusion
    def fusion(self, audio_tokens, visual_tokens):
        # shapes
        BS = audio_tokens.shape[0]
        # concat all the tokens
        concat_ = torch.cat((audio_tokens,visual_tokens),dim=1)
        # cross attention (AV -->> latents)
        fused_latents = self.attention(q=self.latents.expand(BS,-1,-1), k=concat_, v=concat_)
        # cross attention (latents -->> AV)
        audio_tokens = audio_tokens + self.scale_a * self.attention(q=audio_tokens, k=fused_latents, v=fused_latents)
        visual_tokens = visual_tokens + self.scale_v * self.attention(q=visual_tokens, k=fused_latents, v=fused_latents)
        return audio_tokens, visual_tokens

It seems that your implementation of the bottleneck fusion for audio-visual data differs from the idea of paper (https://github.com/google-research/scenic/tree/main/scenic/projects/mbt), especially for concatenating all the tokens. According to the paper, for my personal understanding, each modality seems to solely perform the cross-attention with the bottleneck latents rather than concatenating all the tokens of each modality first.

Hi @LindgeW ,

The original paper proposed "... We first create modality specific temporary bottleneck fusion tokens, which are updated separately and simultaneously with audio and visual information (Equation 8). The final fusion tokens from each cross-modal update are then averaged in (Equation 9) ..." In my implementation of bottleneck fusion, I did not strictly adhere to this for the sake of simplicity.

For the cross-attention steps, (AV -->> latents) followed by (latents -->> AV), I didn't notice any substantial difference in performance. However, in your experiments if this doesn't yield the desired performance, please feel free to modify it to the original implementation as proposed by the authors of the paper.

Cheers!

NMS05 / Multimodal-Fusion-with-Attention-Bottlenecks

bottleneck fusion #1