SonyCSLParis / pesto

Self-supervised learning for fast pitch estimation
GNU Lesser General Public License v3.0
190 stars 15 forks source link

Underestimation of f0 #13

Closed vroger11 closed 1 year ago

vroger11 commented 1 year ago

I tested the model on samples from nsynth and the result is not as expected with underestimation of the f0. Maybe due to the sampling rate being 16k?

aRI0U commented 1 year ago

hmmm... MIR-1K is at 16kHz too and anyway everything is converted as CQT beforehand so the sampling rate should have minimal impact. Could you indicate which step size you use and eventually provide an audio example (filename from nsynth is fine) that does not work? Also, do you get one prediction per frame or do you average them somehow?

vroger11 commented 1 year ago

Here I give you a test with 3 examples. I transform the outputted frequencies into Midi (to compare with the nsynth ground truth). I use the 30 best probabilities to predict the pitch (not using mean but by vote).

import pesto
import torch
import soundfile as sf
import numpy as np
import math
import librosa

def frequency_to_midi(frequency):
    midi_note = 69 + 12 * math.log2(frequency / 440)
    return int(round(midi_note))

def pesto_predict(file, n_best=30, step_size=10.):
    x, sr = sf.read(file, dtype=np.float32)
    x = torch.tensor(x)
    x_t, _ = librosa.effects.trim(x)

    _, pitch, confidence, _ = pesto.predict(x_t.reshape(1, len(x_t)), sr, step_size=step_size)
    temp = np.argpartition(-confidence, n_best)
    unique_values, counts = np.unique(pitch[temp[:n_best]], return_counts=True)

    return frequency_to_midi(unique_values[counts.argmax()])

example_1 = "nsynth-test/audio/bass_electronic_018-029-025.wav"
example_2 = "nsynth-test/audio/guitar_acoustic_010-063-100.wav"
example_3 = "nsynth-test/audio/mallet_acoustic_047-066-100.wav"
print(f"example 1; expected 29, got {pesto_predict(example_1)}")
print(f"example 2; expected 63, got {pesto_predict(example_2)}")
print(f"example 3; expected 66, got {pesto_predict(example_3)}")

The results I got:

example 1; expected 29, got 22
example 2; expected 63, got 35
example 3; expected 66, got 36
aRI0U commented 1 year ago

Hi! Yeah sorry it's a bit confusing but in the CLI it converts to frequencies by default whereas in the Python API pesto.predict has a parameter convert_to_freq=False by default, so in your example you treated midi notes as frequencies. If I convert back your results to frequencies I get (up to rounding errors) the expected result ;) image

aRI0U commented 1 year ago

I'll make this clearer in the docs of the next version, in the meantime you can just remove the frequency_to_midi call in your function and eventually replace it by a np.round if you need an integer as output