Our data comes from a variety of sources, all at different frequencies.
Right now our ETL pipeline converts it all to 16kHz 16-bit signed waveforms.
archive.org has a "bitrate" field, but it is an arbitrary StringType, with no schema. For example, a 44.1kHz audio may have a bitrate field set to null, "", "44.1KHz", "44.1 KHz", "44100Hz", or "44100 Hz".
So clearly we cannot rely on that for archive.org. In general, data comes from sources other than archive.org, so it doesn't make sense to try to make a fuzzy matcher anyway.
The "soxi" utility allows us to inspect sampling rate of any particular input file. This should be fairly reliable.
However, it's possible that the source of the audio may have upsampled the data. For example, an audio file may be 44.1kHz, but have been recorded with a microphone that supports only 22.5kHz sampling rate.
It may be worthwhile therefore to try to detect the the "original" sampling rate of an audio. A straightforward way to do this may be to take the FFT of each audio file, get the noise power at particular frequency bands, and make a hand-coded decision rules to classify into the categories (8kHz, 16kHz, 22.5kHz, 44.1kHz, 48kHz. I don't think there are any other meaningful sampling rates in audio land). This may fail, however, if some files are unusually quiet. This is a general case of https://en.wikipedia.org/wiki/Ordinal_regression, though I'm not quite sure machine learning is the right approach here.
Our data comes from a variety of sources, all at different frequencies.
Right now our ETL pipeline converts it all to 16kHz 16-bit signed waveforms.
archive.org has a "bitrate" field, but it is an arbitrary StringType, with no schema. For example, a 44.1kHz audio may have a bitrate field set to null, "", "44.1KHz", "44.1 KHz", "44100Hz", or "44100 Hz".
So clearly we cannot rely on that for archive.org. In general, data comes from sources other than archive.org, so it doesn't make sense to try to make a fuzzy matcher anyway.
The "soxi" utility allows us to inspect sampling rate of any particular input file. This should be fairly reliable.
However, it's possible that the source of the audio may have upsampled the data. For example, an audio file may be 44.1kHz, but have been recorded with a microphone that supports only 22.5kHz sampling rate.
It may be worthwhile therefore to try to detect the the "original" sampling rate of an audio. A straightforward way to do this may be to take the FFT of each audio file, get the noise power at particular frequency bands, and make a hand-coded decision rules to classify into the categories (8kHz, 16kHz, 22.5kHz, 44.1kHz, 48kHz. I don't think there are any other meaningful sampling rates in audio land). This may fail, however, if some files are unusually quiet. This is a general case of https://en.wikipedia.org/wiki/Ordinal_regression, though I'm not quite sure machine learning is the right approach here.