bfs18 / rfwave

MIT License
91 stars 7 forks source link

Subband vs Duration #3

Open zaptrem opened 3 months ago

zaptrem commented 3 months ago

Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?

bfs18 commented 3 months ago

Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?

Thank you for your interest. The term 'duration' pertains to text-to-speech applications and is unrelated to the reconstruction of audio waveforms from Mel-spectrograms or EnCodec tokens. Regarding the number of subbands, our preliminary experiments indicated that using 16 subbands yields better results than 4 subbands, and 4 subbands outperform full-band processing. Employing more subbands increases the amount of computation, while maintaining a similar count of parameters and utilizing the same dataset. This approach could enhance performance according to the empirical scaling law, although it's worth noting that the scaling law has been primarily summarized from experiments with Transformer models. I've attached a figure to illustrate classifier-free guidance and STFT loss. While CFG enhances objective metrics for vocoder experiments, it does not lead to a corresponding increase in listening test scores; however, when reconstructing waveforms from EnCodec tokens, CFG substantially improves the listening experience. This may be because the abundant and deterministic information in the Mel-spectrogram renders CFG unnecessary. 09f21e15-74c2-4f60-bde3-442e6aaf072f

zaptrem commented 3 months ago

Thanks! A few more questions:

  1. You mention 4 vs 16 subbands but I think the paper and code use 8 subbands. Is there a reason you're discussing 4 vs 16 here?
  2. I see you're using STFT loss in the above which is off by default in the code. Did you find phase loss and overlap loss to produce worse results? Is energy-balanced loss on by default?
  3. Have you tried applying your multi-band approach to Vocos (i.e., no diffusion)?
  4. Have you noticed that models with CFG turned on are significantly louder than those with it off?
  5. I noticed some buzzing when the audio is quiet/silent even with time balanced loss turned on. Does this go away as training continues?
  6. I'm testing out your models but have been getting unusual loss curves. Did you see these as well? image
bfs18 commented 3 months ago

Hi @zaptrem ,

  1. Apologies for the confusion earlier. In our initial experiments, we actually tested performance with 4 vs 8 subbands, not 16 subbands.
  2. The use of STFT loss has demonstrated an advantage in mitigating water-like noise when background noise is present in our experiments. We did not observe any significant improvements by incorporating phase loss. Moreover, the incorporation of overlap loss does not compromise performance; rather, it ensures coherence among the individually modeled subbands. We have determined that a weight coefficient of 0.01 is sufficient for both STFT loss and overlap loss in our setup. Energy balanced loss is discussed in Q5. The following is an example for STFT loss. There are some vertical patterns in the spectrogram of waveforms generated by a model without STFT loss. background.zip

Groundth

gt

With STFT loss

stft

Without STFT loss

wo stft
  1. We have not yet applied the multi-band approach to the Vocos system. However, I believe it could be beneficial for Vocos as well, given that the multi-band approach generally leads to increased computational load.
  2. I noticed that. Training input audio is normalized to a range between -1 and -6 dB. Therefore, normalizing the audio to this volume level during testing should yield more consistent results.
  3. Energy-balanced loss (time balanced loss in code) is designed for this issue. It will disappear as training continues.
  4. This is my loss curve on LJSpeech dataset (22.05 kHz). ae2c1f4e-1557-47c0-96e5-f9816f38c63a

On Opencpop dataset (44.1 kHz). 02624bbb-1dc7-41a9-9a82-c8f98ded9496

zaptrem commented 3 months ago

Thanks! It looks like you probably have a similar staircase effect between 0 and 30k but I'm not certain since it's zoomed out. Also, I noticed your PQMF filter is hard-coded to 8 bands, 124 taps, and cutoff 0.071. If using 16 bands would it be better to set these to (following the trend you set going from 4 to 8) 16 bands, 248 taps, and cutoff 0.0355? Also, similar to CFG scale have you noticed any other changes that disproportionately help with generating waveforms for Encodec tokens?

bfs18 commented 3 months ago

Hi @zaptrem , The PQMF is only used for waveform equalization, and 4 subbands are utilized. These subbands are equalized and then merged into an equalized waveform. The model splits the complex spectrogram into 8 subbands by selecting the appropriate dimensions and it has nothing to do with PQMF. I haven't experimented with splitting the complex spectrogram into 16 subbands, and I haven't observed any other factors that disproportionately enhance the generation of waveforms for Encodec tokens.

zaptrem commented 3 months ago

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

bfs18 commented 3 months ago

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

Hi,

  1. The image link didn't work out. I can't see it. Could you resend it?
  2. The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
zaptrem commented 3 months ago

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

Hi,

  1. The image link didn't work out. I can't see it. Could you resend it?
  2. The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
  1. Thanks!
  2. This one: image

Were these with 16 bands? If not, did 16 bands improve it beyond these results?

bfs18 commented 3 months ago

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

Hi,

  1. The image link didn't work out. I can't see it. Could you resend it?
  2. The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
  1. Thanks!
  2. This one: image

Were these with 16 bands? If not, did 16 bands improve it beyond these results?

In this table, RFWave utilizes 8 subbands, a detail noted in the paper's text but omitted from the table's title.

zaptrem commented 2 months ago

image

Doesn't this table (and yours above) show overlap loss actually making the results worse (PESQ, V/UV, Periodicity) or no better (ViSQOL) compared to no overlap loss?

bfs18 commented 2 months ago

image

Doesn't this table (and yours above) show overlap loss actually making the results worse (PESQ, V/UV, Periodicity) or no better (ViSQOL) compared to no overlap loss?

In initial trials, I noticed the presence of horizontal striations across subbands within the spectrograms occasionally. Implementing overlap loss as a countermeasure effectively mitigated this issue, leading to its adoption as the default setting. I did not expect an improvement in objective metrics with overlap loss. My primary goal was to ensure robust performance across a diverse range of configurations.

zaptrem commented 2 months ago

Thanks. Your model uses a lot of bands/compute on frequencies the human ear doesn't really care about, so I've been trying to fix that. Here's what 16 bands looks like with your current approach:

standard_spectrogram

I thought of using an stft with a higher n_fft (4096) for the lower frequencies (so more bands/compute is spent on parts we care about) and a lower n_fft on the higher frequencies (1024 but still satisfying COLA since hop size is 512).

stft_comparison_50_iterations

The problem I ran into is my multi-resolution STFT implementation is not as perfectly invertible as a normal spectrogram (though it's close), and I'm not quite sure why. When you run it back and forth hundreds of times some ringing/artifacts will appear on the border between the two resolutions. When you train on these (with overlap turned off since it doesn't make sense across resolution boundaries and wave=true, haven't tried false) the model seems to have a much harder time wrapping its head around anything and after one night still hasn't figured out phase.

Have you tried anything like this?

My code in case you're interested: https://gist.github.com/zaptrem/94d10c5d76d2f601841e9f8e8bf4859a

Also, I'm not quite sure I understand the motivation for doing this pred = self.stft(self.istft(pred)) all the time (e.g., in compute_stft_loss and when taking inference steps) when wave is true. Why do you do it?

And why did you stop using the trick Vocos uses for better phase outputs (since phase is periodic)?

Also, (sorry for so many questions haha) with wave=false did you find the feature_loss to improve things or make them worse?

bfs18 commented 2 months ago

Hi @zaptrem

  1. I've tested your code and noticed a horizontal line between the two subbands. Additionally, there seems to be an error in the code regarding the subband dimensions; they should be 1024 and 257 instead of 1024 and 256. However, even after correcting this, there are still artifacts present. I have not yet determined a solution to this issue. One potential approach to mitigate the error accumulation could be to set wave=False and conduct the modeling directly in the frequency domain. This would necessitate only a single inverse STFT operation after sampling, which may reduce the severity of artifacts. 20240701-144528
  2. stft(istft(pred)) follows the stft and istft operation in Figure 1 in the paper.
zaptrem commented 2 months ago
  1. Additionally, there seems to be an error in the code regarding the subband dimensions; they should be 1024 and 257 instead of 1024 and 256

Thanks! Can you tell me more about where this is happening and how you fixed it? One of my suspicions for why this was performing significantly worse was the shape/location of the real/imaginary channels was shifting between bands, but I could never determine whether that was actually happening. Also noting here that I tried some CQT/wavelet based transforms and got similarly mediocre results, I think because they're much more periodic than real/imag STFT.

Also, I've noticed the waterfall effect seems to continue with even with STFT loss enabled in high-noise regions of the spectrogram and it becomes more prominent as you get further in the training run. Could it be caused by errors being amplified between bands via the stft/istft operation from each step?

zaptrem commented 1 month ago

image

Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.

bfs18 commented 1 month ago

image

Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled. Hi @zaptrem I also don't know how to fix the mismatch when shifting between bands of different FFT size. Regarding the waterfall noise, I believe it occurs because the model attempts to reconstruct a phase even for background noise, where no meaningful phase is present. Increasing the STFT loss weight might resolve this issue by making the model place slightly more emphasis on the magnitude information.

zaptrem commented 1 month ago

image

Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.

Hi @zaptrem

I also don't know how to fix the mismatch when shifting between bands of different FFT size.

Regarding the waterfall noise, I believe it occurs because the model attempts to reconstruct a phase even for background noise, where no meaningful phase is present. Increasing the STFT loss weight might resolve this issue by making the model place slightly more emphasis on the magnitude information.

I figured out you can use a PQMF to get clean cuts that you can apply different STFT settings to, but I haven't tried a model with it yet because idk how the aliasing cancellation will work when neither top-level-band is aware of the other.

Something I noticed (though haven't rigorously confirmed) is that the waterfall effect is actually less prominent earlier in training. Also notice how the lines span between bands. Could small errors be getting amplified by the overlap between bands or possibly some sort of spectral bleeding during sampling? Did you see the waterfall effect in your experiments from before you started using STFT(ISTFT()) each step during sampling?

Edit: The waterfalling goes away entirely when I disable the stft(istft()) at inference time. However, otherwise the quality becomes worse: image

Top: istft/stft turned off. Middle: Turned on. Bottom: Ground truth.

zaptrem commented 1 month ago

@accum-dai @Beronx86 Have you guys noticed this? It is possible solving this issue could help RFWave beat BigVGAN.

bfs18 commented 1 month ago

Something I noticed (though haven't rigorously confirmed) is that the waterfall effect is actually less prominent earlier in training. Also notice how the lines s

Hi @zaptrem , I apologize for not noticing that you updated your post. "stft(istft(x))" is added when the model supports conditioning higher bands on lower bands and is applied on a subband's complex spectrum. This is because the inverse short-time Fourier transform (istft) of a crafted spectrum may not yield a real signal, which causes error accumulation when taking the inverse as real and applying many sampling steps. When predicting all subbands in parallel, "stft(istft(x))" actually operates on the full band spectrum, making it potentially unnecessary. However, I have not yet attempted to remove it. I appreciate that you tried it, and I will conduct further tests to explore this.

bfs18 commented 3 weeks ago

Hi @zaptrem Is your dataset publicly available? I would like to test it.

zaptrem commented 3 weeks ago

Hi @zaptrem Is your dataset publicly available? I would like to test it.

It's a personal dataset so unfortunately can't share it. Closest approximation would be something like https://github.com/mdeff/fma plus https://github.com/MTG/mtg-jamendo-dataset.

bfs18 commented 3 weeks ago

It's a personal dataset so unfortunately can't share it. Closest approximation would be something like https://github.com/mdeff/fma plus https://github.com/MTG/mtg-jamendo-dataset.

@zaptrem Is it possible to share a few samples for testing, especially the one shown in the figure above?

zaptrem commented 3 weeks ago

I wish I could, but I'm not certain the license grants me permission to post it publicly. Here's a sample I made myself that should repro the issue

https://github.com/user-attachments/assets/d5d314c4-d7c0-4de2-9680-2e4a75978144

bfs18 commented 3 weeks ago

I wish I could, but I'm not certain the license grants me permission to post it publicly. Here's a sample I made myself that should repro the issue

audiotest.mp4

Using this checkpoint, I haven't reproduce your issue. Could you test it on your internal test case and update the result? @zaptrem gt

gt

syn

syn