Audio codec used for training in the original paper - very low bandwidth/quality?

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.23k stars 2.03k forks source link

Audio codec used for training in the original paper - very low bandwidth/quality? #282

Open vican9000 opened 10 months ago

vican9000 commented 10 months ago

First of all, great project!

One question though: in the original paper, you mentioned using a four quantizer Encodec for MusicGen training, with a pretty large stride (50 Hz). This will produce a pretty low quality output (and monophonic, and 32 kHz-only). Have you done any ablation studies with trying larger bandwidths? For instance, in the Encodec paper, you've trained a stereo 48kHz 24kbit/s model. What were the issues with using this in MusicGen?

@adefossez hopefully you can shed some light here. Thanks!

jbmaxwell commented 10 months ago

Related to this question, I'm curious what the requirements are for training a stereo MusicGen model. A while ago I tested with the 48khz stereo EnCodec but it seems it's not supported with MusicGen, due to the normalization values that accompany the latent codes. So can anyone give any advice/guidance on stereo MusicGen training? Is it actually possible, or do the stereo EnCodec models always have the normalization values?