david-andrew / AudioDomainPitchDetector0

Archive of (failed) attempt to make an audio domain pitch detector using the same process as the SPICE pitch detector model
4 stars 0 forks source link

TRaining spice #1

Open bayethiernodiop opened 2 years ago

bayethiernodiop commented 2 years ago

Did you train spice with your own data? If yes what implementations did you use? Is it open source?

david-andrew commented 2 years ago

Did you train spice with your own data? If yes what implementations did you use? Is it open source?

If I recall, I only tried to train my own model, not SPICE itself. For data, I believe I used the vocalset dataset: https://zenodo.org/record/1193957

bayethiernodiop commented 2 years ago

Thanks for your answer. It looks like the data is for speaker/singer/tone recognition and not for pitch tracking. The input looks like an audio, and the output a class. Am I missing something? My goal is to predict the similarity between original audio and an imitation using the pitch vectors. If you know another way of doing this, I would appreciate your sharing. I tried the pretrained CREPE model, but the results are bad on my dataset(not music just acapella without instrument).

david-andrew commented 2 years ago

Thanks for your answer. It looks like the data is for speaker/singer/tone recognition and not for pitch tracking. The input looks like an audio, and the output a class. Am I missing something? My goal is to predict the similarity between original audio and an imitation using the pitch vectors. If you know another way of doing this, I would appreciate your sharing. I tried the pretrained CREPE model, but the results are bad on my dataset(not music just acapella without instrument).

To use vocalset to train in this context, you can ignore the labels, as you really just want examples of monophonic instruments. Presumably any audio dataset that contains examples of solo voices/instruments should work. My understanding is that SPICE is supposed to automatically perform pitch shifting in the constant-Q domain, to allow it to train in a self-supervised manner. For the model I was trying to develop in this repository, pitch shifting was a preprocessing step I did where I would generate clips that were shifted by known amounts using SBSMS to perform the shift in the audio domain rather than the frequency/constant-Q domain.

But to be clear, I haven't used SPICE before, I only read the paper and used it as inspiration for my own model here.