Guidance on reproducing reported SDRi in the paper

gzhu06 commented 2 years ago

Hi,

First of all, great work! And thanks for sharing the code, much appreciated!

I'm trying to reproduce the vocal part of the MUSDB18 results shown in Table 1 reported in the paper. But I got really bad results for SDRi.

(1) For the data preprocessing part, I cut the original mixture into 5 second segments where vocals are activated, (in some cases vocal part are only silence);

(2) For separation, I'm using the code snippet from the colab in the repo. In my implementation, my parameters are:

TAGGER_SR = 16000  # Hz
JUKEBOX_SAMPLE_RATE = 44100  # Hz

# tagger source
tagger_training_data = 'MagnaTagATune' #@param ["MTG-Jamendo", "MagnaTagATune"] {allow-input: false}
tag = 'Vocals'

# audio processing parameters
fft512 = True 
fft1024 = True 
fft2048 = True 

n_ffts = []
if fft512:
    n_ffts.append(512)
if fft1024:
    n_ffts.append(1024)
if fft2048:
    n_ffts.append(2048)

# network architecture selections
fcn = True #@param {type:"boolean"}
hcnn = True #@param {type:"boolean"} 
musicnn = True #@param {type:"boolean"}
crnn = False #@param {type:"boolean"}
sample = False #@param {type:"boolean"}
se = False #@param {type:"boolean"}
attention = False #@param {type:"boolean"}
short = False #@param {type:"boolean"}
short_res = False #@param {type:"boolean"}

# separation paras
use_mask = True
lr = 5.0  
steps = 30

(3) For evaluating SDRi, I'm using asteriod package instead of museval (when using museval, SDR can be easily changed by multiplying some scaler to the audio samples even not evaluating SI-SDR).

(4) Also I'm using the saved _masked.wav files to compute SDR (actually _raw_masked.wav get higher SDR)

So I'm wondering which step could possibly be the problem cause the bad results? Thank you so much!

gzhu06 commented 2 years ago

Another question is, in the paper, the optimization is gradient ascent, does this mean the tagger cannot correctly label the decoded audio after optimization? But in the code it's just gradient descent.

ethman commented 2 years ago

Hi Ge,

Thanks for reaching out! I'll try to answer the questions as best I can.

(1) For the data preprocessing part, I cut the original mixture into 5 second segments where vocals are activated, (in some cases vocal part are only silence);

We did not filter silent parts of the signals for eval; we run eval on the whole audio file as is done in SiSEC/MDX. We chop the whole signal into sliding windows with a length 10 seconds and a hop of 5 seconds, apply a Hamming window to each segment and add them back up. Also we pad the first and last signal with one hop of 0s, and then remove them before stitching them back together. This is the same process as doing an STFT except with a much larger window size (and no FFT). We used this torch convenience function for getting the windows

(2) For separation, I'm using the code snippet from the colab in the repo. In my implementation, my parameters are: ...

Yeah, this is buried in the paper a little bit, but we found that the best perceptual results were when we used masking with three FFT/window sizes, but we actually found that we got the best SDR scores with just one window size. There's a lot of reasons to be distrustful of SDR, and to me this is a clear case where SDR fails.

Also, I think we only turned on one tagger at a time for the experiments in the paper. I was hoping that if we used multiple taggers simultaneously, we could get better results (our setup is very similar to adversarial attacks, so the thought process is that by using multiple taggers we could mitigate the system optimizing to just the biases of only one tagger), but this ended up not mattering really.

Finally, I believe we used lr = 10.0 (high!), and steps = 10 IIRC, but you would have to check the paper to confirm.

(3) For evaluating SDRi, I'm using asteriod package instead of museval (when using museval, SDR can be easily changed by multiplying some scaler to the audio samples even not evaluating SI-SDR).

Asteroid has two different SDR implementations (a mir_eval wrapper and SI-SDR), both of which will give you different results than the museval implementation. All the implementations have the same name, but are in fact all different. We used the museval SDR implementation.

(4) Also I'm using the saved _masked.wav files to compute SDR (actually _raw_masked.wav get higher SDR)

Do those sound better though? IIRC, those didn't sound like the separated source. This I think gets back to "don't trust SDR."

Another question is, in the paper, the optimization is gradient ascent, does this mean the tagger cannot correctly label the decoded audio after optimization? But in the code it's just gradient descent.

I'm not sure I understand the question here. We're optimizing the Jukebox VQVAE embedding such that it produces audio that most matches the user-defined tags according to the tagger. So after the optimization is done the tagger will think that the audio matches the desired tags, no matter what it actually sounds like.

As for whether this is gradient ascent or descent: technically, we do not negate the direction of the gradient before we do an optimization step which makes it gradient descent, however it seems like there's some sloppiness in terminology out in the world. We got this idea from VQGAN+CLIP (& descendants), who called their system gradient ascent because the input is optimized not the model (i.e., opposite from the typical setup).

Anyway, I hope this helps! Let me know if you have any more questions.

-Ethan

gzhu06 commented 2 years ago

Thank you so much for the detailed replies, they are very helpful!

Regarding the training tricks, I'll try those suggestions later.

And I agree the point that "don't trust SDR", also based on my experiences, good SDR may reflect good separation results, but bad SDR doesn't necessarily mean they are perceptually bad. (it's also true that *_masked.wav files sound like separated sources)

About my last question, I'm not sure I understand it clearly why "mask M hat multiplied by the mixture spectrogram is to get an estimate of the audio data that should be removed from the mix".

Say if we want to separate "guitar" track from the mixture, then we set Ttarget=(one hot encoding for 'guitar'), the predicted tag probability (over decoded embeddings) should be consistent with Ttarget after optimization, right? Then shouldn't this decoded embeddings be what we want to extract?

ethman commented 2 years ago

And I agree the point that "don't trust SDR", also based on my experiences, good SDR may reflect good separation results, but bad SDR doesn't necessarily mean they are perceptually bad. (it's also true that *_masked.wav files sound like separated sources)

Whoops, actually this was my mistake in the previous reply. The *_masked.wav files are the intended source output files (I forgot how I named them). So I guess it's good that they sound like the sources!! 😊 😊

I'm not sure I understand it clearly why "mask M hat multiplied by the mixture spectrogram is to get an estimate of the audio data that should be removed from the mix". Say if we want to separate "guitar" track from the mixture, then we set Ttarget=(one hot encoding for 'guitar'), the predicted tag probability (over decoded embeddings) should be consistent with Ttarget after optimization, right? Then shouldn't this decoded embeddings be what we want to extract?

Yeah, so Jukebox is going to create an audio clip based on the location of the embedding space that you're in and if we don't impose some kind of restrictions on Jukebox it will generate audio that doesn't sound anything like the mix. Let's think about what happens when we let Jukebox generate audio freely, without any constraints. Theoretically, you're 100% right: if we optimized a "guitar" tag, it could synthesize audio that sounds like a guitar from our mix. The catch is, because Jukebox is unconstrained, this audio could sound any guitar, not the one in the mix that we care about. But, again, it's unconstrained: it doesn't have to sound like a guitar at all! In practice, the optimization process ends up finding garbled nonsense that tricks the tagger into activating the right tags, even though the audio sounds nothing like a "guitar." We could maybe consider this an adversarial example for the tagger. You can try this for yourself in the colab notebook by unchecking the use_mask box. This should answer the second part of your question.

But, to fix this issue, we impose some restraints on what Jukebox can generate. Specifically, we tell Jukebox that it can only take away things from the input mixture to get a source estimate (this is what almost all source separation systems do in fact). The way we impose this restriction on Jukebox is by taking its output audio and using it to make a mask on a magnitude spectrogram of the mixture (in the same way you might make an ideal ratio mask, or IRM). So this mask is the M hat.

To be fair to you, two of our reviews from ICASSP mentioned that this section of the paper was unclear. So I don't blame you for being confused! We're rewriting it now so hopefully it will be better!

Best! -Ethan

gzhu06 commented 2 years ago

Hi Ethan,

Thanks for the clear explanation, the idea of applying masks to constraint the generated audio from Jukebox is great! And also quite important. Now I'm crystal clear about why you design this generation process.

And one remaining minor question would be, why the masks are designed to remove from the mixture?

According to the algorithm1 in the paper: let's say after many iterations, Ts is very close to Ttarget (guitar tag) in step9, which means iSTFT(S) or S has very high probability for guitar while low probabilities for other instruments computed by Tagger. So S is basically guitar, then M X should be guitar --> Since each element of M ranges from 0 to 1 and this operation is like a weighted version of X or this M X should be "some part" that is taken from X, right? So instead of taking from the mixture with the masks, why it's taking away?

ethman commented 2 years ago

I'm not sure if I understand your question. Masks always remove from mixtures. In general, masks always remove information, even when we're talking about masked language models (but they mask whole words, whereas we mask just some part of a TF bin in a spectrogram). This is the "weighted version of X" that you mention.

So instead of taking from the mixture with the masks, why it's taking away?

I don't understand the difference between these two options. Can you clarify further what you mean?

gzhu06 commented 2 years ago

According to this page:

A mask is a matrix that is the same size as a spectrogram and contains values in the inclusive interval [0.0,1.0]. Each value in the mask determines what proportion of energy of the original mixture that a source contributes. In other words, for a particular TF bin, a value of 1.0 will allow all of the sound from the mixture through and a value of 0.0 will allow none of the sound from the mixture through.

This is what I'm trying to say. More specifically, suppose we have a 5s audio segment which has 4s of pure guitar in the beginning and 1s of pure piano in the end, so the guitar and piano are not overlapped at all. And we want to separate guitar from this "mixture". The question is what should this mask look like? Is it like [1, 1, 1, 1, 0] or [0, 0, 0, 0 ,1]? (For simplicity, we only use 5 time bin and 1 frequency bin, so the TF bin matrix is just 1 by 5.)

According to the paper, suppose we run tagbox for this segment for a long time, so the objective function L(Ts, Ttarget) is converged. We can hypothetically traverse back to see what the mask looks like:

Because L is converging, Ts is roughly Ttarget, and Ttarget is a one-hot vector for guitar; So Ts is approximately one-hot vector for guitar;
So iSTFT(S^hat) is basically guitar waveform;
S^hat = M dot X is some guitar spectrogram;

It means when we element dot product mask M with mixture X, we can get our target source, guitar. Therefore the mask should be like [1, 1, 1, 1, 0]. And if this is correct, in the final step we can simply use iSTFT(S^hat) instead of x-iSTFT(S^hat)

gzhu06 commented 2 years ago

By the way, this page is awesome!

ethman commented 2 years ago

Yeah, that last step in TagBox is a little unintuitive. Why do we subtract x-iSTFT(S^hat) instead of just using iSTFT(S^hat)? The answer comes down to what exactly JukeBox is producing & how we use its output.

I should have linked you to this page of the tutorial as well where we actually use the masks in context. In those equations in that section, S is the ground truth source sepectrogram, M^hat is a mask (any mask for now), and Y is the mixture spectrogram. It's easy to see the relationship between the mask, mixture spec, and GT spec in the first equation. But in the second equation we solve for the mask, M^hat. So given some mixture Y and a source S, we can figure out what the mask will look like. Note that S doesn't have to be a ground truth source spectrogram. Note that we can use any other audio spectrogram to make a mask on a mixture, not just the ground truth spectrogram (we probably wouldn't call it an "ideal mask" if we don't use the ground truth though 🙂).

Let's turn back to TagBox. That second equation of masking math in the previous paragraph is actually step 6 in the algorithm in the paper. Like I said, we can use any audio to create an "ideal mask" on a mixture. So we use the audio that the Jukebox embeddings output. We convert the raw audio to a spectrogram and then use our mask equation to determine a mask (step 6 & the last paragraph), we apply that mask to the mix spectrogram (step 7 in the algorithm from the paper), then we ask the Tagger what the masked audio sounds like, and update the Jukebox embedding space accordingly.

So... back to our question: what is Jukebox producing exactly? It's producing audio that gets turned into a mask, which is used to remove things from the mix. Jukebox is determining what to remove from the mix. Therefore, the final result of our separation is x-iSTFT(S^hat), and not just iSTFT(S^hat).

Some additional thoughts:

I made this long argument about why this thing works, but don't take my word for it! Listen to the results! In the colab notebook, when you use the mask we return both Jukebox's raw output (iSTFT(S^hat)) and x-iSTFT(S^hat). Which one sounds more like the isolated source? I think we discussed this above.
Why do this complicated masking? Why not put Jukebox's output directly into the Tagger without the mix? Again, this is the discussion that we had up there ^^^, so I won't recap it, but you can see how that discussion is related to this one.
Why not just apply the mask directly at the end instead of subtracting in the time domain? I believe we tried this and it didn't sound as good, but worth testing out again if you want. But there are some implementation details to work out, namely, we've noticed that TagBox results sound the best when we choose multiple sets of params when computing spectrograms for the mix & Jukebox output audio (similar to multi-scale spectral loss). In that case, we have multiple masks, so which one is best to apply to the final output? Just one? If so, which one? Maybe some kind of mean result from all masks? I'm not sure, but worth playing around with if you're curious.
Another thing we could have tried would have been optimizing the difference between Jukebox's output and the mixture in the time domain. I don't recall if we did that or not, but you do lose the nice inductive biases that the spectrogram provides like explicitly representing frequencies and phase invariance (for better or worse). You could certainly try this and let me know how it goes.

gzhu06 commented 2 years ago

Thank you so much for these detailed explanations.

use any audio to create an "ideal mask" on a mixture

this is quite important to understanding that confusing step. Also eliciting pretrained priors is a very interesting direction. Thanks again!

ethman / tagbox

Guidance on reproducing reported SDRi in the paper #1