Currently, we compute the cross-correlation between time-domain key waveforms to determine how similar 2 keys are.
Instead, we can compute the similarity metric over the Mel spectrograms of the signals. The Mel spectrogram seems to be the go-to choice for audio representation in modern state-of-the-art speech recognition algorithms, so why not give it a try in keytap.
Here is a sample implementation to compute the log-scaled Mel spectrogram of an audio, that I recently did for the whisper.cpp project:
Currently, we compute the cross-correlation between time-domain key waveforms to determine how similar 2 keys are. Instead, we can compute the similarity metric over the Mel spectrograms of the signals. The Mel spectrogram seems to be the go-to choice for audio representation in modern state-of-the-art speech recognition algorithms, so why not give it a try in keytap.
Here is a sample implementation to compute the log-scaled Mel spectrogram of an audio, that I recently did for the
whisper.cpp
project:https://github.com/ggerganov/whisper.cpp/blob/6d654d192a62e6cd9897d6ff683bdc97406827e9/main.cpp#L1962-L2063