archinetai / audio-diffusion-pytorch

Audio generation using diffusion models, in PyTorch.
MIT License
1.92k stars 167 forks source link

Future Work - Models #67

Open AI-Guru opened 1 year ago

AI-Guru commented 1 year ago

Hi!

I am very curious about the future work part of the paper.

There were a few suggestions in the paper. Let me talk about two.

1. Use perceptual losses.

You have just merged a PR that allows for loss customization. Which perceptual loss did you have in mind when you wrote the suggestion?

2. Using mel spectrograms instead of magnitude spectrograms as input.

dmae1d-ATC64-v2 Uses the magnitude spectrogram.

What would be a good mel feature extractor?

I sometimes ran into this one but I would like to know what you think about it:

encoder=MelE1d( # The encoder used, in this case a mel-spectrogram encoder
                in_channels=in_channels,
                channels=512,
                multipliers=[1, 1],
                factors=[2],
                num_blocks=[12],
                out_channels=32,
                mel_channels=80,
                mel_sample_rate=48000,
                mel_normalize_log=True,
                bottleneck=TanhBottleneck(),
            ),

I believe it extracts a lot of features, thus putting a strain on the GPU.

Curious what you have to say about 1 and 2.

Cheers, Tristan