Hello, I wonder which dataset you used to train CLAP (especially for music).
The reason I ask is audio embeddings from audio syntheized from MIDI is not closely aligned with text embeddings (MusicCaps, AudioStock, LP-MusicCaps, AudioSet). (when I draw samples in t-SNE space)
Hello, I wonder which dataset you used to train CLAP (especially for music).
The reason I ask is audio embeddings from audio syntheized from MIDI is not closely aligned with text embeddings (MusicCaps, AudioStock, LP-MusicCaps, AudioSet). (when I draw samples in t-SNE space)
Also, GTZAN embeddings show similar situation.
Or, can you let me know the example of captions?
Regards.