facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
21.06k stars 2.17k forks source link

Is it possible to generate higher quality audio at 44100 or 48000 Hz for example? #404

Open ElizavetaSedova opened 10 months ago

ElizavetaSedova commented 10 months ago

I noticed that sound quality is good for music generation, but is it possible to improve it?

DEBIHOOD commented 10 months ago

At the moment, at least not with the pretrained models that was released to this date.

Personally, i see 2 possible solutions for that:

Solution 1. Using some kind of upsampling model, that will upscale it from 32kHz (used in encodec that musicgen relies on) to 44100, 48000 or 96000 Hz. It's basically same as super-resolution for images that was researched for couple of years, but on audio it's quite new. I think i've seen some model/paper that was released not too long ago that was addressing this, but don't remember the name. This option will probably give small or even tiny difference in improving audio quality, i'll explain why i think so, further throughout this writing.

Solution 2. Using the released encodec that was trained for 48kHz. It would require to finetune existing musicgen transformer LM to be able to operate in latent space of this encodec model, because tokens between encodec 32kHz(used in all released musicgen models) and encodec 48kHz have different meanings, leading to different places. To be able to perform finetuning of LM, this 48kHz encodec also probably better to have the same token rate of 50 tokens per second(50Hz), same codebook dictionary size of 1024(i think that's what encodec 32kHz used), and same amount of codebooks, encodec 32kHz has 4 codebooks, so it translates 1 second of audio into 4 pairs of 50 tokens. After taking a quick look at the encodec paper, i think that they used 16 codebooks for 48kHz model, not found the info about the "what 1 second of audio translates into in terms of number of tokens", but let's just hope that it's the same 50 tokens per second. So there's quite a mismatch because when starting the finetuning of musicgen transformer LM, it would expect for encodec to have 4 codebooks, specifying that it would have 16 probably will lead to an error of mismatching dimensionalities, which if it occurs, could probably be fixed with some model surgery. It might be very well possible that it might just fail(errors before even starting the training script, or potential divergence, and also nobody excludes the possibility that it will train just fine, but the results would be the same/even worse in terms of quality than it was previously), and you might be left with the only option of training your musicgen transformer LM model starting from scratch, which you probably is not even considering, so i will just skip that part. You can also train your own encodec model to work with 44100/48000 Hz audio, so it will have all the other dimensionalities same as what the LM model expects: 50 tokens per second, 4 codebooks with dictionary of 1024, which will make finetuning the LM to work in the new latent space easier. We can also consider taking the existing encodec 32kHz and finetuning it into 44.1/48kHz model(if that's even possible with model surgery?), which i think would make finetuning of LM model even easier, because "tokens meanings" would probably not "run too far away" from their starting point. Although i have a feeling that training encodec is quite brittle process, so might just fail after this drastic change from 32kHz to whatever would be chosen. I think they choose 32kHz for their musicgen's encodec, because it has stride of 640, so it conveniently translates into 32000/640 = 50Hz, so... nice pretty numbers!

Now, why i think that wanting 44.1/48kHz is not really the goal that you should pursue? You'll probably not notice much difference in terms of quality. Humans can hear in the range of 20Hz - 22050Hz(everything above 20kHz is audible only for infants). Nyquist–Shannon theory suggests that to be able to fully represent that range, you should have twice as much, so 44100Hz. Encodec 32kHz is still theoretically able to represent human-audible frequencies of up to 16kHz, which is not that far from 20kHz. In fact, as people age, they also loose some hearing, usually their upper limit reduces to 15–17 kHz (15kHz is even lower than encodec's abilities!). I did a small test, and i converted a song that i know very well, into 32kHz from the original 44.1kHz. I was hardly able to tell the difference. I've found some info that "a person in their twenties will be able to hear up to 17,000Hz or more, by their thirties this will have declined to about 16,000Hz. By the time an individual is in their 50s, their hearing range will usually have declined to around 12,000Hz.". Based on this, i guess it means that i'm able to hear somewhere in the range of 17kHz-19.5kHz, and to my knowledge i don't have any hearing issues. Test it with the song that you like and see if you're able to tell the difference between 32 and 44.1kHz.

If 32kHz is not the issue, then what is? We need better autoencoder system. Compressing the song using encodec, and then decoding it back, introduces a lot of audible artifacts, even if it would be in 44.1/48kHz. Multi-band-diffusion improved quality quite a bit by replacing the decoder with diffusion, which by the way, as a side product has lead to longer decoding process. There was paper called DAC, that improved on advances of encodec. Although it was better, several months later authors of musicgen released second version of their paper, where they said they tested training the LM model using the tokens of released pretrained DAC model, they said they have got worse result compared do encodec. It either stems from the dataset that DAC was trained on, or the fact that it has 9 codebooks and "performs quantization in a lower dimension space", which probably harder for transformer to work with(which is interesting to actually know why it happened, knowing that DAC is a better encoder). Section about it was quite small and without much detail, they also didn't released these LM models, so it hard to say why it turned out worse, or how much worse it is in terms of audible quality(also no generated samples of model trained on DAC's tokens).


This all doesn't directly answers "how" to achieve what you want, but it still provides quite a bit of interesting points about final audio quality that you hear. Let's see what future papers about audio autoencoders and music generation will bring us in the future. "What a time to be alive!"