Open xdcesc opened 5 years ago
We tend to use WPE together with other component, e.g. beamforming. When doing to, we use parameters typical for that application.
In this example [1, 2] we use 512 as a window size. But we tend to check various sizes/ shifts when performance is important.
In [2] we use it together with a beamformer. Since 1024 size and 256 shift worked better on this dataset for beamforming, we used this parameters. Its worth noting, that all other parameters (minimum delay, ...) should ideally be checked, e.g. on the development set.
[1] https://groups.uni-paderborn.de/nt/pubs/2018/IWAENC_2018_Heymann_Paper.pdf [2] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8683294 [3] https://groups.uni-paderborn.de/nt/pubs/2018/INTERSPEECH_2018_Drude_Paper.pdf
@LukasDrude Thanks for your reply. I do some simulations using different echo lengths and DFT sizes. It is true that we need check various DFT sizes to get optimal performance, for example, for 800ms echo, the best DFT window size is 1024. And what confused me is using 2048-point DFT makes it worse. Considering coherent bandwidth of room impulse response, greater DFT window size should not lead to performance degradation.
@xdcesc I for sure recommend to not tune the DFT size to each single utterance. We tend to set the parameters on the train or validation set and then keep that value for the test set.
In general, with DFT sizes you have different effects playing in. If your DFT size is very high, you have very few time frames for WPE to calculate the covariance matrix. You have a high frequency resolution, but that does not really help when the algorithm provides inaccurate estimates.
Also keep in mind that when you change DFT size you basically have to tune all other parameters as well (e.g. change minimum delay, ...).
@LukasDrude Could you please explain why choosing STFT size 512 (with shift 128)? Is is related to the coherence bandwidth of RIR?