archinetai / audio-diffusion-pytorch

Audio generation using diffusion models, in PyTorch.
MIT License
1.95k stars 169 forks source link

How to convert to wav file to listen to result? #46

Closed dillfrescott closed 1 year ago

dillfrescott commented 1 year ago

I am a bit confused how to do this. Any help would be appreciated! :)

tralala87 commented 1 year ago

I would also like to know this. I can't seem to download the audio file after running the code.

After running this: `from audio_diffusion_pytorch import DiffusionModel, UNetV0, VDiffusion, VSampler import torch

model = DiffusionModel(

... same as unconditional model

net_t=UNetV0, # The model type used for diffusion (U-Net V0 in this case)
in_channels=2, # U-Net: number of input/output (audio) channels
channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer
items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
attentions=[0, 0, 0, 0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer
attention_heads=8, # U-Net: number of attention heads per attention item
attention_features=64, # U-Net: number of attention features per attention item
diffusion_t=VDiffusion, # The diffusion method used
sampler_t=VSampler, # The diffusion sampler used
use_text_conditioning=True, # U-Net: enables text conditioning (default T5-base)
use_embedding_cfg=True, # U-Net: enables classifier free guidance
embedding_max_length=64, # U-Net: text embedding maximum length (default for T5-base)
embedding_features=768, # U-Net: text mbedding features (default for T5-base)
cross_attentions=[0, 0, 0, 1, 1, 1, 1, 1, 1], # U-Net: cross-attention enabled/disabled at each layer

)

Train model with audio waveforms

audio_wave = torch.randn(1, 2, 2**18) # [batch, in_channels, length] loss = model( audio_wave, text=['The audio description'], # Text conditioning, one element per batch embedding_mask_proba=0.1 # Probability of masking text with learned embedding (Classifier-Free Guidance Mask) ) loss.backward()

Turn noise into new audio sample with diffusion

noise = torch.randn(1, 2, 2**18) sample = model.sample( noise, text=['The audio description'], embedding_scale=5.0, # Higher for more text importance, suggested range: 1-15 (Classifier-Free Guidance Scale) num_steps=2 # Higher for better quality, suggested num_steps: 10-100 )`

Where and how can I download the audio file?

dillfrescott commented 1 year ago

I don't think the audio file is actually being written anywhere. Just the data stored in memory I believe.

mangoleaf commented 1 year ago

I have played around trying to save the tensors as wav files (sample below for others interested), however I seem to only receive audio files out that are complete static.

I would really appreciate it if someone can offer a correct solution to this. I would be happy to submit code adding a utility for this as well.

# Turn noise into new audio sample with diffusion
noise = torch.randn(1, 2, 2**18)
sample = model.sample(
    noise,
    text=['Bird chirping'],
    embedding_scale=5.0, # Higher for more text importance, suggested range: 1-15 (Classifier-Free Guidance Scale)
    num_steps=15 # Higher for better quality, suggested num_steps: 10-100
)

import soundfile as sf
def save_wav(tensor, path):
    tensor = tensor.squeeze()
    tensor = tensor / tensor.max()
    nparray = tensor.squeeze().numpy(force=True).astype('float32').T
    sf.write(path, nparray, samplerate=44100, format='wav')
    print("Done saving file")

save_wav(sample, "test_generated_sound.wav")
flavioschneider commented 1 year ago

This is a library for researchers to train audio diffusion models, no pre-trained models are provided here -- that's why you are getting only static, the model is not trained

mangoleaf commented 1 year ago

Good to know, I'll admit I thought I saw it pulling down a pre-trained model when installing everything. That said, you ignored my question which is still relevant as others have already asked and I will be looking into training this later today.

I was asking if I am correctly interpreting the tensor data and converting it to a wav file, this should be included in the repo as a utility and I was offering to help by opening a pull request to add it to the utility package.

flavioschneider commented 1 year ago

The correct way to save a tensor to .wav file is as follows:

import torchaudio
sample_rate = 48000
torchaudio.save('test_generated_sound.wav', sample[0], sample_rate)

Where sample[0] indicates that you want to save the first element of the batch.

No additional utility or library is required for that