[Idea] Compute key similarity over the log-scale Mel spectrogram

Currently, we compute the cross-correlation between time-domain key waveforms to determine how similar 2 keys are. Instead, we can compute the similarity metric over the Mel spectrograms of the signals. The Mel spectrogram seems to be the go-to choice for audio representation in modern state-of-the-art speech recognition algorithms, so why not give it a try in keytap.

Here is a sample implementation to compute the log-scaled Mel spectrogram of an audio, that I recently did for the whisper.cpp project:

https://github.com/ggerganov/whisper.cpp/blob/6d654d192a62e6cd9897d6ff683bdc97406827e9/main.cpp#L1962-L2063

ggerganov / kbd-audio

[Idea] Compute key similarity over the log-scale Mel spectrogram #49