Text-to-Audio / Make-An-Audio

PyTorch Implementation of Make-An-Audio (ICML'23) with a Text-to-Audio Generative Model
MIT License
737 stars 107 forks source link

Personalized Text-to-Audio Generation ? #2

Closed SoshyHayami closed 7 months ago

SoshyHayami commented 8 months ago

Hi, I've seen that you showed an example personalizing an audio sfx in your demo samples page. Can you tell me how to implement this with the inference code you have provided here?

Also, do you think it's worth if We train on an entire music dataset for Music Generation task or is it only for sound effects and perhaps some light music generation? what steps should we take if we want to train on a higher sample rate ? (let's say 32k or 48k or perhaps even stereo)

Thanks.

Darius-H commented 8 months ago
  1. The generation quality of personalizing audio is not satisfying compared to text-to-audio.
  2. If you want to use it for music generation, you can finetune the diffusion on music dataset to get a better performance. I think VAE can generalize to music dataset and need not to be trained again, you can use VAE to reconstruct the samples on your music dataset to check if you need to train a new VAE. If reconstruction performance is good, you need not to train a new VAE.
  3. If you want to train on a higher sample rate, you can change the config here https://github.com/Text-to-Audio/Make-An-Audio/blob/ccc63dc790614bba9509c891759ff30d3d83e0f2/preprocess/mel_spec.py#L196C1-L211C6. Change audio_sample_rate to 32000 or 48000, and i think audio_num_mel_bins should be changed to 160 to get higher quality.
  4. We haven't tried stereo audio generation. Maybe you can generate each melspec for each sound channel to get a final melspec of shape (sound_channels,audio_num_mel_bins, T_mel).
SoshyHayami commented 7 months ago

Thank you very much; I still wish you'd include it even if it wasn't satisfying; I was desperately looking for a similar feature and the only working one I found was Meta's Audio Gen, which is notoriously hard to train.

Regarding the VAE, the sample rate is 16khz, should I increase and re-train again? what do you suggest? and if i change the num_mel_bins, wouldn't that conflict with BIGVGAN? tbh I have a single or 2x V100 (32gb), AudioLDM made me hopeful on how far I can push this, I hope I can train on this config since this project seem very interesting.

Darius-H commented 7 months ago

I have updated audio2audio.py. If you want to use a higher sample rate and update the melspec process, you need to retrain the VAE, and you also need to retrain BIGVGAN too. It will be time costing. If you are making research, better not change the sample rate since you need to compare with previous works.