JoshVarty / AudioTagging

Working on: https://www.kaggle.com/c/freesound-audio-tagging-2019
MIT License
2 stars 0 forks source link

Consider incorporating other representations of sound into our model #38

Open JoshVarty opened 5 years ago

JoshVarty commented 5 years ago

It sounds like other features may help us improve our models ability to distinguish between sounds.

From: https://www.kaggle.com/c/freesound-audio-tagging-2019/discussion/93337#latest-537350 image

Paper that describes some of this: https://arxiv.org/pdf/1905.00078.pdf

B. Audio Features, '… However, due to the physics of sound production, there are additional correlations for frequencies that are multiples of the same base frequency (harmonics). To allow a spatially local model (e.g., a CNN) to take these into account, a third dimension can be added that directly yields the magnitudes of the harmonic series [14], [15].'

Here's one such feature that might be worth exploring: https://librosa.github.io/librosa/generated/librosa.core.cqt.html

JoshVarty commented 5 years ago

The window size for computing spectra trades temporal resolution (short windows) against frequential resolution (long windows). Both for log-mel and constant-Q spectra, it is possible to use shorter windows for higher frequencies, but this results in inhomogeneously blurred spectrograms unsuitable for spatially local models. Alternatives include computing spectra with different window lengths, projected down to the same frequency bands, and treated as separate channels [16]. In [17] the authors also investigated combinations of different spectral features.

JoshVarty commented 5 years ago

I think there are a few experiments here worth carrying out:

  1. Create logmel spectrograms with 3 different window sizes (n_fft) and stack them together into RGB images
  2. Try stacking other representations with logmel spectrogram. Constant Q Transform seems to be a popular one. Continuous Wavelet Transform is another. For more information see: https://arxiv.org/pdf/1706.07156.pdf
JoshVarty commented 5 years ago

xresnet18: 0.8409990 xresnet18: 0.8438536 xresnet18: 0.8428597 xresnet18 with 3 channels: 0.845929 xresnet18 with 3 channels: 0.844033 xresnet18 with 3 channels: 0.846719

Looks like there's a marginal increase about ~0.003