atosystem / SpeechCLIP

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022
https://atosystem.github.io/blogs/speechclip
BSD 3-Clause "New" or "Revised" License
108 stars 6 forks source link

about speech-text implement #4

Closed xiaoyaoxiaoxian closed 8 months ago

xiaoyaoxiaoxian commented 1 year ago

Hi, I'm trying to reproduce the result in your paper.However, I could not find the implement for speech-text and text-speech. Could you share this part of code ? I also try to implement it by myself, but I have some problems.

  1. In function forward_text it calls original2Reduced twice when inference. one in forward_text and another in prep_text
  2. I try to prompt text 'turn on', but it has 'key error'. Does it mean the token is not in the reduced_embedding? How to solve this problem
atosystem commented 1 year ago

@xiaoyaoxiaoxian Sorry for the late reply. I'm sorry that I do not save the code for speech-text retrieval on my computer. But I believe you won't need to change much of the code. To do so, you need to get rid of "original2Reduced". For the speech branch you can load the pretrained checkpoint. For the text branch you just need to use the original CLIP Text encoder with original tokenizer and token embedding.