atosystem / SpeechCLIP

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022
https://atosystem.github.io/blogs/speechclip
BSD 3-Clause "New" or "Revised" License
108 stars 6 forks source link

Simple Embeddings #2

Closed corranmac closed 1 year ago

corranmac commented 1 year ago

Hi,

Please could you provide a simple way to load a model and test a single audio clip to produce an embedding?

Thank you very much.

atosystem commented 1 year ago

@corranmac Got it! I have updated the readme. See example.py

corranmac commented 1 year ago

Thanks for such a quick response!

I was wondering if I'm able to transform these outputed embeddings to the same shape as clip, for use in speech-image retrieval and also image generation trained on clip embeddings? I can't seem to find a seperate class for encoding, eg. model.encode() like clip has.

Thanks

atosystem commented 1 year ago

@corranmac Yeah, you can use the semantic embedding of speech to calculate similarity with image embeddings for speech-image retrieval. In fact, this is how we do it in our paper. I have added a function in the model class (kwClip.py) for extracting the semantic embedding for speech input https://github.com/atosystem/SpeechCLIP/blob/e2a572d905e8b9d9365ccb8a44dca5c13a60744a/avssl/model/kwClip.py#L1299-L1315

atosystem commented 1 year ago

@corranmac If there is no further question, I will close this issue.