Open LindgeW opened 3 weeks ago
Hi @LindgeW ,
The original paper proposed "... We first create modality specific temporary bottleneck fusion tokens, which are updated separately and simultaneously with audio and visual information (Equation 8). The final fusion tokens from each cross-modal update are then averaged in (Equation 9) ..." In my implementation of bottleneck fusion, I did not strictly adhere to this for the sake of simplicity.
For the cross-attention steps, (AV -->> latents) followed by (latents -->> AV), I didn't notice any substantial difference in performance. However, in your experiments if this doesn't yield the desired performance, please feel free to modify it to the original implementation as proposed by the authors of the paper.
Cheers!
Hi, thanks for your good job.
It seems that your implementation of the bottleneck fusion for audio-visual data differs from the idea of paper (https://github.com/google-research/scenic/tree/main/scenic/projects/mbt), especially for concatenating all the tokens. According to the paper, for my personal understanding, each modality seems to solely perform the cross-attention with the bottleneck latents rather than concatenating all the tokens of each modality first.