Using MusicGen to generate embeddings?

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

21.02k stars 2.17k forks source link

Using MusicGen to generate embeddings? #495

Open DuBose-Tuller opened 1 month ago

DuBose-Tuller commented 1 month ago

Has anyone used MusicGen to try and generate embeddings for audio/music datasets? Specifically the language model part, not just EnCodec. I have been trying to do this myself for a research project, and I am struggling to achieve any meaningful separation, even between dramatically different datasets.

nerusskikh commented 1 month ago

Generally, causal (left-to right, autoregressive) models don't make great embeddings cause the first tokens missing a lot of context due to attention structure. Masked language models are better suited for embeddings. That's the reason why many projects (including audiocraft) use T5 for text embeddings despite that larger and newer (but autoregressive) models are available.

Perhaps MagNET would be better for what you're trying to achieve since its non-autoregressive