Closed vroger11 closed 1 year ago
hmmm... MIR-1K is at 16kHz too and anyway everything is converted as CQT beforehand so the sampling rate should have minimal impact. Could you indicate which step size you use and eventually provide an audio example (filename from nsynth is fine) that does not work? Also, do you get one prediction per frame or do you average them somehow?
Here I give you a test with 3 examples. I transform the outputted frequencies into Midi (to compare with the nsynth ground truth). I use the 30 best probabilities to predict the pitch (not using mean but by vote).
import pesto
import torch
import soundfile as sf
import numpy as np
import math
import librosa
def frequency_to_midi(frequency):
midi_note = 69 + 12 * math.log2(frequency / 440)
return int(round(midi_note))
def pesto_predict(file, n_best=30, step_size=10.):
x, sr = sf.read(file, dtype=np.float32)
x = torch.tensor(x)
x_t, _ = librosa.effects.trim(x)
_, pitch, confidence, _ = pesto.predict(x_t.reshape(1, len(x_t)), sr, step_size=step_size)
temp = np.argpartition(-confidence, n_best)
unique_values, counts = np.unique(pitch[temp[:n_best]], return_counts=True)
return frequency_to_midi(unique_values[counts.argmax()])
example_1 = "nsynth-test/audio/bass_electronic_018-029-025.wav"
example_2 = "nsynth-test/audio/guitar_acoustic_010-063-100.wav"
example_3 = "nsynth-test/audio/mallet_acoustic_047-066-100.wav"
print(f"example 1; expected 29, got {pesto_predict(example_1)}")
print(f"example 2; expected 63, got {pesto_predict(example_2)}")
print(f"example 3; expected 66, got {pesto_predict(example_3)}")
The results I got:
example 1; expected 29, got 22
example 2; expected 63, got 35
example 3; expected 66, got 36
Hi!
Yeah sorry it's a bit confusing but in the CLI it converts to frequencies by default whereas in the Python API pesto.predict
has a parameter convert_to_freq=False
by default, so in your example you treated midi notes as frequencies. If I convert back your results to frequencies I get (up to rounding errors) the expected result ;)
I'll make this clearer in the docs of the next version, in the meantime you can just remove the frequency_to_midi call in your function and eventually replace it by a np.round
if you need an integer as output
I tested the model on samples from nsynth and the result is not as expected with underestimation of the f0. Maybe due to the sampling rate being 16k?