facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.
MIT License
20.5k stars 2.06k forks source link

hyper parameters about rvq, causal, streamable/non-streamable setup for encodec training #264

Open adagio715 opened 11 months ago

adagio715 commented 11 months ago

Hello everyone, I want to train a stereo 48khz encodec model on my own datasets. I have a few questions about the setting of hyper parameters of rvq, causal/non-causal, streamable/non-streamable setup:

  1. I read in the paper "High Fidelity Neural Audio Compression" that "for all of our models, we use at most 32 codebooks (16 for the 48khz model) with 1024 entries each". I was wondering why only n_q=16 for the 48khz model? I want to set n_q=32 for the 48khz model training, will this cause any problem?
  2. As a continuation of the first question, I noticed that encodecs used for audiogen and musicgen both by default have n_q set at 4, which is quite small. Why not make it larger? I suspect that a larger n_q could produce better audio quality for audiogen and musicgen?
  3. From the papers, I think the highest encodec quality comes from non-streamable setup. However, I don't know which parameters should be set to make it non-streamable model training. I think about the parameters seanet.norm and seanet.pad_mode in the config/model/encodec/default.yaml, but I'm not sure if I'm in the correct direction. Can anyone give some hints?
  4. What does rvq.r_dropout stand for? What does it influence if this parameter is set to true or false, respectively?
  5. In the config/model/encodec/default.yaml, encodec.causal and encodec.renormalize are both set as False. Is it recommened to set them as True if we want to achieve higher audio quality?

Thank you very much for your help!

alexandre-xn commented 7 months ago

@adagio715 did you ever find out the answer for 1 and 2 ?

adagio715 commented 7 months ago

@adagio715 did you ever find out the answer for 1 and 2 ?

Not really... I guess it was a trade-off between effects, efficiency, model size, etc. Did you have any insight on this? @AlexandreDRFT

alexandre-xn commented 6 months ago

@adagio715 did you ever find out the answer for 1 and 2 ?

Not really... I guess it was a trade-off between effects, efficiency, model size, etc. Did you have any insight on this? @AlexandreDRFT

@adagio715 no, actually I'm still struggling to launch an experiment properly because of those params. I have a dataset correctly setup with a bunch of data at 22050 sample rate and the public encodec for 24khz, i can't figure out the correct params for codebooks and n_q to make it work. And the docs and paper are not clear on this. i'm interested if you have any working setup for those !