Support usage with non-audio data e.g spectrograms

Kinyugo commented 1 year ago

I am trying to use the package to work with spectrograms, but I have encountered a problem. Some of the operations in the package are only designed to work with 3-d tensors, which limits their usability.

Request

I would like to request a change to make these operations more generic, so that they can be used with spectrograms (or any other data that may not necessarily be 3-d tensors). This would enable more users to use the package for a wider range of applications, and improve the overall usability of the package.

Examples

To illustrate the issue and the desired change, I have provided some examples below.

Sequential mask generation

The sequential_mask operation generates a boolean mask for a tensor. The original version of the operation is shown below:

def sequential_mask(like: Tensor, start: int) -> Tensor:
    length, device = like.shape[2], like.device
    mask = torch.ones_like(like, dtype=torch.bool)
    mask[:, :, start:] = torch.zeros((length - start,), device=device)
    return mask

To make this operation more generic, we could change the third dimension (dim=2) to the last dimension (dim=-1). This would allow the operation to work with any tensor, regardless of its shape. The revised version of the operation would look like this:

def sequential_mask(like: Tensor, start: int) -> Tensor:
    length, device = like.shape[-1], like.device
    mask = torch.ones_like(like, dtype=torch.bool)
    mask[..., start:] = torch.zeros((length - start,), device=device)
    return mask

I am happy to contribute, to address these issues.

flavioschneider commented 1 year ago

What parts do you want to use with 2D tensors? The model is a 1D UNet, hence it uses 1D convolutions, it's not directly applicable to 2D as long as you don't change the entire architecture. You could stack the channels of spectrograms and use those in the UNet1d though.

Kinyugo commented 1 year ago

I am interested in using the diffusion part only, i.e: schedulers, samplers, inpainting e.t.c. I think those parts could be made generic without altering the working of the rest of the code.

flavioschneider commented 1 year ago

Yes, that's a good idea. I have to update the diffusion structure a bit in the following days (i.e. make it more adaptable to different diffusion types and samplers), and I will consider changing that as well.

Kinyugo commented 1 year ago

That will be awesome 🔥 🔥

flavioschneider commented 1 year ago

To follow up on this, v-diffusion + sampler (the ones I found to work the best) are now generic to any dimension. I temporarily removed the other k-diffusion ones as I wasn't getting amazing results with them. Do you need those as well? As a bonus, the U-Net from a-unet is also generic to any dimension, just in case :)

Kinyugo commented 1 year ago

Thanks. That's amazing. I will be trying v-diffusion in a follow up, currently I went with the diffusers implementation of DDIM (see project here). However, DDIM is needs more iterations compared to the recent techniques. For the U-Net, the project implements a custom u-net tailored for spectrograms. You could check to see if it's something you might consider adding to this project. Good Job 👏🏿

flavioschneider commented 1 year ago

Will close this as spectrograms are supported. Feel free to reopen if you think there's something missing

archinetai / audio-diffusion-pytorch