LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.33k stars 128 forks source link

How to understand and use the audio embedding? #148

Open arthur19312 opened 4 months ago

arthur19312 commented 4 months ago

I'm new here, I run the method get_audio_embedding_from_filelist with model music_audioset_epoch_15_esc_90.14.pt and get the audio embeddings just like

[[-4.639852792024612427e-02, -9.935184381902217865e-03, ...]]

I approximately know it represent the feature of the input audio somehow, while I don't know how to use it.

At first I try some method to decode it into text so that I can understand it roughly.

So I start to be confused. Could someone tell me what is the audio embedding that I get in format of float? And whether this audio embedding is common to other models? And how should I use it?

(PS: I'm really interested in this work while it seems like I lack some necessary background knowledge, so it would be better if someone could recommend me some relevant materials to get me into the field. Thank you so much ❤)

cvillela commented 4 months ago

I am having similar doubts.

When extracting text and audio embeddings, I can easily perform cosine similarity to find closely related pairs, and retrieve audio from text inputs and vice-versa.

However, I would like to know if there is a way to decode the embeddings into text. Decoding them into Audio seems manageable using AudioLDM.

satvik-dixit commented 1 month ago

@cvillela is there a way to decode CLAP embeddings to Audio using AudioLDM?