Open DuBose-Tuller opened 1 month ago
Generally, causal (left-to right, autoregressive) models don't make great embeddings cause the first tokens missing a lot of context due to attention structure. Masked language models are better suited for embeddings. That's the reason why many projects (including audiocraft) use T5 for text embeddings despite that larger and newer (but autoregressive) models are available.
Perhaps MagNET would be better for what you're trying to achieve since its non-autoregressive
Has anyone used MusicGen to try and generate embeddings for audio/music datasets? Specifically the language model part, not just EnCodec. I have been trying to do this myself for a research project, and I am struggling to achieve any meaningful separation, even between dramatically different datasets.