Using Cross-AVID for Audio-Video Synchronization

Hi,

The work presented in this paper is fascinating and I thank you for releasing the code as well.

I have read the paper several times and have gone through the code as well, and I just wanted to find out if Cross AVID can be used for performing audio-visual lip synchronization, and if so, how? By audio visual lip-synchronization, I mean that the model encourages the video and audio embeddings to be close-by in the embedding space when the audio is in-sync with the mouth movements in the video and vice versa.

From my understanding, this was not the focus of the paper or the code because my understanding is that the model is instead encouraged to match the corresponding memory features from the memory bank (this is also what is defined as the 'target' in the paper).

My understanding is that, in this portion of the code and this code portion, instead of working with positive memories, we would instead be using the computed embeddings from here?

If you could advise on how cross-AVID can be adapted to perform audio-visual lip-sync synchronization, and whether my understanding is correct, it would be highly appreciated.

Thanks!

facebookresearch / AVID-CMA

Using Cross-AVID for Audio-Video Synchronization #7