Open JoshVarty opened 5 years ago
The window size for computing spectra trades temporal resolution (short windows) against frequential resolution (long windows). Both for log-mel and constant-Q spectra, it is possible to use shorter windows for higher frequencies, but this results in inhomogeneously blurred spectrograms unsuitable for spatially local models. Alternatives include computing spectra with different window lengths, projected down to the same frequency bands, and treated as separate channels [16]. In [17] the authors also investigated combinations of different spectral features.
I think there are a few experiments here worth carrying out:
n_fft
) and stack them together into RGB imagesxresnet18
: 0.8409990
xresnet18
: 0.8438536
xresnet18
: 0.8428597
xresnet18
with 3 channels: 0.845929
xresnet18
with 3 channels: 0.844033
xresnet18
with 3 channels: 0.846719
Looks like there's a marginal increase about ~0.003
It sounds like other features may help us improve our models ability to distinguish between sounds.
From: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93337#latest-537350
Paper that describes some of this: https://arxiv.org/pdf/1905.00078.pdf
Here's one such feature that might be worth exploring: https://librosa.github.io/librosa/generated/librosa.core.cqt.html