LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.43k stars 137 forks source link

How to understand and use the audio embedding? #148

Open arthur19312 opened 6 months ago

arthur19312 commented 6 months ago

I'm new here, I run the method get_audio_embedding_from_filelist with model music_audioset_epoch_15_esc_90.14.pt and get the audio embeddings just like

[[-4.639852792024612427e-02, -9.935184381902217865e-03, ...]]

I approximately know it represent the feature of the input audio somehow, while I don't know how to use it. Could someone tell me what is the audio embedding that I get in format of float? And whether this audio embedding is common to other models? And how should I use it?

(PS: I'm really interested in this work while it seems like I lack some necessary background knowledge, so it would be better if someone could recommend me some relevant materials to get me into the field. Thank you so much ❤)

cvillela commented 6 months ago

I am having similar doubts.

When extracting text and audio embeddings, I can easily perform cosine similarity to find closely related pairs, and retrieve audio from text inputs and vice-versa.

However, I would like to know if there is a way to decode the embeddings into text. Decoding them into Audio seems manageable using AudioLDM.

satvik-dixit commented 4 months ago

@cvillela is there a way to decode CLAP embeddings to Audio using AudioLDM?

arthur19312 commented 1 month ago

When I make an analogy to CLIP, I would know how to use CLAP. My mind was stuck then ><. Thanks for your hints! Now we know AudioLDM will turn text into audio, and is there any tools works like clip interrogator to turn audio into text?

waldleitner commented 1 month ago

@arthur19312 The following CLAP implementration also supports a model for audio captioning (not yet tested):

https://arxiv.org/abs/2309.05767 https://github.com/microsoft/CLAP