Open arthur19312 opened 6 months ago
I am having similar doubts.
When extracting text and audio embeddings, I can easily perform cosine similarity to find closely related pairs, and retrieve audio from text inputs and vice-versa.
However, I would like to know if there is a way to decode the embeddings into text. Decoding them into Audio seems manageable using AudioLDM.
@cvillela is there a way to decode CLAP embeddings to Audio using AudioLDM?
When I make an analogy to CLIP, I would know how to use CLAP. My mind was stuck then ><. Thanks for your hints! Now we know AudioLDM will turn text into audio, and is there any tools works like clip interrogator to turn audio into text?
@arthur19312 The following CLAP implementration also supports a model for audio captioning (not yet tested):
https://arxiv.org/abs/2309.05767 https://github.com/microsoft/CLAP
I'm new here, I run the method
get_audio_embedding_from_filelist
with modelmusic_audioset_epoch_15_esc_90.14.pt
and get the audio embeddings just likeI approximately know it represent the feature of the input audio somehow, while I don't know how to use it. Could someone tell me what is the audio embedding that I get in format of float? And whether this audio embedding is common to other models? And how should I use it?
(PS: I'm really interested in this work while it seems like I lack some necessary background knowledge, so it would be better if someone could recommend me some relevant materials to get me into the field. Thank you so much ❤)