How to understand and use the audio embedding?

arthur19312 commented 4 months ago

I'm new here, I run the method get_audio_embedding_from_filelist with model music_audioset_epoch_15_esc_90.14.pt and get the audio embeddings just like

[[-4.639852792024612427e-02, -9.935184381902217865e-03, ...]]

I approximately know it represent the feature of the input audio somehow, while I don't know how to use it.

At first I try some method to decode it into text so that I can understand it roughly.

I guess it may be interlinked with CLIP, so I use the inner SimpleTokenizer at first with bpe_simple_vocab_16e6, while it doesn't work.
Then I try the RobertaTokenizerFast in transformers to decode it, while the error says the decoder just support integers input tokens.

So I start to be confused. Could someone tell me what is the audio embedding that I get in format of float? And whether this audio embedding is common to other models? And how should I use it?

(PS: I'm really interested in this work while it seems like I lack some necessary background knowledge, so it would be better if someone could recommend me some relevant materials to get me into the field. Thank you so much ❤)

cvillela commented 4 months ago

I am having similar doubts.

When extracting text and audio embeddings, I can easily perform cosine similarity to find closely related pairs, and retrieve audio from text inputs and vice-versa.

However, I would like to know if there is a way to decode the embeddings into text. Decoding them into Audio seems manageable using AudioLDM.

satvik-dixit commented 1 month ago

@cvillela is there a way to decode CLAP embeddings to Audio using AudioLDM?

LAION-AI / CLAP

How to understand and use the audio embedding? #148