archinetai / audio-diffusion-pytorch

Audio generation using diffusion models, in PyTorch.
MIT License
1.97k stars 168 forks source link

Questions about conditional generation #61

Open AI-Guru opened 1 year ago

AI-Guru commented 1 year ago

Hi!

I have worked with unconditional generation using this fine repo. It is a lot of fun! I will do latent diffusion next. I am already looking forward to it.

Text conditional generation promises a lot of fun. I have a few questions.

This is so cool!

Best, Tristan

flavioschneider commented 1 year ago
  1. That's correct
  2. Yes
  3. You'd have to use use_text_conditioning=False and provide your own embedding with embedding=.... See here if you want to make your own plugin for the UNet
  4. More tokens would mean that each sequence at each layer in the UNet would have to cross attend to the provided embedding. This would be a bit slower depending on how many more tokens you have, but possibly carry more information for the UNet.
SuperiorDtj commented 1 year ago

the num of paras in text condition model is only 562M rather than 857M in mousai paper, is there any extra config in text condition model?