Closed fmiotello closed 4 months ago
Hi @fmiotello, I am not entirely sure what you are asking. The inference is able to be "conditioned" on input audio because of how the diffusion process works. Effectively, it takes any sort of input and "moves" it into the region of known sounds. So whether the input is white noise, or it is a "conditional" input of some other audio (e.g. drum sound), it will always bring it into its known region.
Are you asking that if certain sounds can be used as inputs during the training process rather than just white noise the entire time? If so, I suppose this is possible, but may reduce the capabilities of the model.
If you are asking if it could be conditioned on something like text prompts (e.g. Stable Audio, MusicGen, etc.), it certainly could be. However, I have not built this capability into this codebase, but the tools to add this functionality should exist in audio-diffusion-pytorch, which is the library that this is built upon.
Yes, I was referring to the possibility of conditioning the training on other types of data rather than audio. I'll look into the audio-diffusion-pytorch library then. Thanks for your help!
No problem and good luck!
I've seen in the accompanying Jupyter notebook that, during inference, it is possible to generate samples conditioning the input with unseen audio data. Is it possible to also condition the training process with custom features, as is common in other diffusion models?
Thank you!