Open zaptrem opened 6 months ago
Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?
Thank you for your interest. The term 'duration' pertains to text-to-speech applications and is unrelated to the reconstruction of audio waveforms from Mel-spectrograms or EnCodec tokens. Regarding the number of subbands, our preliminary experiments indicated that using 16 subbands yields better results than 4 subbands, and 4 subbands outperform full-band processing. Employing more subbands increases the amount of computation, while maintaining a similar count of parameters and utilizing the same dataset. This approach could enhance performance according to the empirical scaling law, although it's worth noting that the scaling law has been primarily summarized from experiments with Transformer models. I've attached a figure to illustrate classifier-free guidance and STFT loss. While CFG enhances objective metrics for vocoder experiments, it does not lead to a corresponding increase in listening test scores; however, when reconstructing waveforms from EnCodec tokens, CFG substantially improves the listening experience. This may be because the abundant and deterministic information in the Mel-spectrogram renders CFG unnecessary.
Thanks! A few more questions:
Hi @zaptrem ,
Groundth
With STFT loss
Without STFT loss
On Opencpop dataset (44.1 kHz).
Thanks! It looks like you probably have a similar staircase effect between 0 and 30k but I'm not certain since it's zoomed out. Also, I noticed your PQMF filter is hard-coded to 8 bands, 124 taps, and cutoff 0.071. If using 16 bands would it be better to set these to (following the trend you set going from 4 to 8) 16 bands, 248 taps, and cutoff 0.0355? Also, similar to CFG scale have you noticed any other changes that disproportionately help with generating waveforms for Encodec tokens?
Hi @zaptrem , The PQMF is only used for waveform equalization, and 4 subbands are utilized. These subbands are equalized and then merged into an equalized waveform. The model splits the complex spectrogram into 8 subbands by selecting the appropriate dimensions and it has nothing to do with PQMF. I haven't experimented with splitting the complex spectrogram into 16 subbands, and I haven't observed any other factors that disproportionately enhance the generation of waveforms for Encodec tokens.
Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap
and right_overlap
parameters (which are both set to 8 by default) in RectifiedFlow
?
Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the
left_overlap
andright_overlap
parameters (which are both set to 8 by default) inRectifiedFlow
?
Hi,
Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the
left_overlap
andright_overlap
parameters (which are both set to 8 by default) inRectifiedFlow
?Hi,
- The image link didn't work out. I can't see it. Could you resend it?
- The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
Were these with 16 bands? If not, did 16 bands improve it beyond these results?
Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the
left_overlap
andright_overlap
parameters (which are both set to 8 by default) inRectifiedFlow
?Hi,
- The image link didn't work out. I can't see it. Could you resend it?
- The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
- Thanks!
- This one:
Were these with 16 bands? If not, did 16 bands improve it beyond these results?
In this table, RFWave utilizes 8 subbands, a detail noted in the paper's text but omitted from the table's title.
Doesn't this table (and yours above) show overlap loss actually making the results worse (PESQ, V/UV, Periodicity) or no better (ViSQOL) compared to no overlap loss?
Doesn't this table (and yours above) show overlap loss actually making the results worse (PESQ, V/UV, Periodicity) or no better (ViSQOL) compared to no overlap loss?
In initial trials, I noticed the presence of horizontal striations across subbands within the spectrograms occasionally. Implementing overlap loss as a countermeasure effectively mitigated this issue, leading to its adoption as the default setting. I did not expect an improvement in objective metrics with overlap loss. My primary goal was to ensure robust performance across a diverse range of configurations.
Thanks. Your model uses a lot of bands/compute on frequencies the human ear doesn't really care about, so I've been trying to fix that. Here's what 16 bands looks like with your current approach:
I thought of using an stft with a higher n_fft (4096) for the lower frequencies (so more bands/compute is spent on parts we care about) and a lower n_fft on the higher frequencies (1024 but still satisfying COLA since hop size is 512).
The problem I ran into is my multi-resolution STFT implementation is not as perfectly invertible as a normal spectrogram (though it's close), and I'm not quite sure why. When you run it back and forth hundreds of times some ringing/artifacts will appear on the border between the two resolutions. When you train on these (with overlap turned off since it doesn't make sense across resolution boundaries and wave=true, haven't tried false) the model seems to have a much harder time wrapping its head around anything and after one night still hasn't figured out phase.
Have you tried anything like this?
My code in case you're interested: https://gist.github.com/zaptrem/94d10c5d76d2f601841e9f8e8bf4859a
Also, I'm not quite sure I understand the motivation for doing this pred = self.stft(self.istft(pred))
all the time (e.g., in compute_stft_loss and when taking inference steps) when wave is true. Why do you do it?
And why did you stop using the trick Vocos uses for better phase outputs (since phase is periodic)?
Also, (sorry for so many questions haha) with wave=false did you find the feature_loss to improve things or make them worse?
Hi @zaptrem
- Additionally, there seems to be an error in the code regarding the subband dimensions; they should be 1024 and 257 instead of 1024 and 256
Thanks! Can you tell me more about where this is happening and how you fixed it? One of my suspicions for why this was performing significantly worse was the shape/location of the real/imaginary channels was shifting between bands, but I could never determine whether that was actually happening. Also noting here that I tried some CQT/wavelet based transforms and got similarly mediocre results, I think because they're much more periodic than real/imag STFT.
Also, I've noticed the waterfall effect seems to continue with even with STFT loss enabled in high-noise regions of the spectrogram and it becomes more prominent as you get further in the training run. Could it be caused by errors being amplified between bands via the stft/istft operation from each step?
Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.
Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled. Hi @zaptrem I also don't know how to fix the mismatch when shifting between bands of different FFT size. Regarding the waterfall noise, I believe it occurs because the model attempts to reconstruct a phase even for background noise, where no meaningful phase is present. Increasing the STFT loss weight might resolve this issue by making the model place slightly more emphasis on the magnitude information.
Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.
Hi @zaptrem
I also don't know how to fix the mismatch when shifting between bands of different FFT size.
Regarding the waterfall noise, I believe it occurs because the model attempts to reconstruct a phase even for background noise, where no meaningful phase is present. Increasing the STFT loss weight might resolve this issue by making the model place slightly more emphasis on the magnitude information.
I figured out you can use a PQMF to get clean cuts that you can apply different STFT settings to, but I haven't tried a model with it yet because idk how the aliasing cancellation will work when neither top-level-band is aware of the other.
Something I noticed (though haven't rigorously confirmed) is that the waterfall effect is actually less prominent earlier in training. Also notice how the lines span between bands. Could small errors be getting amplified by the overlap between bands or possibly some sort of spectral bleeding during sampling? Did you see the waterfall effect in your experiments from before you started using STFT(ISTFT()) each step during sampling?
Edit: The waterfalling goes away entirely when I disable the stft(istft()) at inference time. However, otherwise the quality becomes worse:
Top: istft/stft turned off. Middle: Turned on. Bottom: Ground truth.
@accum-dai @Beronx86 Have you guys noticed this? It is possible solving this issue could help RFWave beat BigVGAN.
Something I noticed (though haven't rigorously confirmed) is that the waterfall effect is actually less prominent earlier in training. Also notice how the lines s
Hi @zaptrem , I apologize for not noticing that you updated your post. "stft(istft(x))" is added when the model supports conditioning higher bands on lower bands and is applied on a subband's complex spectrum. This is because the inverse short-time Fourier transform (istft) of a crafted spectrum may not yield a real signal, which causes error accumulation when taking the inverse as real and applying many sampling steps. When predicting all subbands in parallel, "stft(istft(x))" actually operates on the full band spectrum, making it potentially unnecessary. However, I have not yet attempted to remove it. I appreciate that you tried it, and I will conduct further tests to explore this.
Hi @zaptrem Is your dataset publicly available? I would like to test it.
Hi @zaptrem Is your dataset publicly available? I would like to test it.
It's a personal dataset so unfortunately can't share it. Closest approximation would be something like https://github.com/mdeff/fma plus https://github.com/MTG/mtg-jamendo-dataset.
It's a personal dataset so unfortunately can't share it. Closest approximation would be something like https://github.com/mdeff/fma plus https://github.com/MTG/mtg-jamendo-dataset.
@zaptrem Is it possible to share a few samples for testing, especially the one shown in the figure above?
I wish I could, but I'm not certain the license grants me permission to post it publicly. Here's a sample I made myself that should repro the issue
https://github.com/user-attachments/assets/d5d314c4-d7c0-4de2-9680-2e4a75978144
I wish I could, but I'm not certain the license grants me permission to post it publicly. Here's a sample I made myself that should repro the issue
audiotest.mp4
Using this checkpoint, I haven't reproduce your issue. Could you test it on your internal test case and update the result? @zaptrem gt
syn
Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?