Stability-AI / stable-audio-tools

Generative models for conditional audio generation
MIT License
2.72k stars 258 forks source link

Figure out and fix marination and inpainting. #109

Open piwell opened 4 months ago

piwell commented 4 months ago

I was investigating inpainting but it didn't behave as I expected. This is in both how the marination parameter worked and how the inpainting mask worked. I made some assumptions that I will try to outline them as detailed as possible. But after these small changes it works as I would expect. I'll start with what I actually think needs a fix, the get_bmask function and then what I fixed to align to my assumptions, build_mask. (If the later assumption is wrong I can remove the second commit).

marination and bmask

Given this comment below and how the inpainting_callback is implemented I made some assumption on how this should work and tried to fix it.

# builds a softmask given the parameters
# returns array of values 0 to 1, size sample_size, where 0 means noise / fresh generation, 1 means keep the input audio, 
# and anything between is a mixture of old/new
# ideally 0.5 is half/half mixture but i haven't figured this out yet

My assumption is that given 10 steps and a mask of only 0.5s, for the first 5 steps new_x on this line would be new_x = input_noise then for the last 5 would be new_x = x. Or in other words the bmask for all steps would look something like this:

tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) 

With this current implementation however it looks like this:

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Where we keep the generated output at the start but then copy the initial noise on the later steps.

One more test to make sure!

Below is another test to ensure the new implementation is correct. Given a mask like this:

mask = [i / 10 for range(10)]

The current implementation give these bmasks for 10 steps:

tensor([1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])
tensor([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
tensor([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Where we copy more and more from the initial data.

The new implementation gives these bmasks for 10 steps:

tensor([0, 1, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([0, 0, 1, 1, 1, 1, 1, 1, 1, 1])
tensor([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])
tensor([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
tensor([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])
tensor([0, 0, 0, 0, 0, 0, 0, 1, 1, 1])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Where we copy more at the start then let the generation process take over more and more.

Inpainting mask

Given the parameters names maskstart and maskend my assumption is that I can mark out a section to regenerate. For example, for a 10s clip I can set maskstart=10 and maskend=90 to keep the first and last seconds and inpaint the other 8 in the middle. The current implementation does the opposite making it impossible to replace the middle part.

Comments

I hope this could help us explore inpainting and marination some more but if my assumptions I based these changes on are wrong feel free to disregard or reject. Thank you!