New Metric: Pitch accuracy when all strings are merged to one

anthonio9 commented 9 months ago

Poly Pitch Net models are doing quite well with pitch recognition (up to 82%), however one of the main problems remains the distinction between the strings. Create a new metric that will ignore the strings distinction and will merge all string to one, then check the pitch accuracy, RPA, RCA, RMSE. This new metric will tell how good the pitch recognition is overall, instead of how good the network is at the task of pitch recognition and string separation at the same time

anthonio9 commented 9 months ago

The new metric can be called Full RPA, Full RCA, Full RMSE, I'm not sure what's better. There are thresholds set for periodicity, and again those thresholds are not that accurate (60-80%), it could be due to the network just being better at recognizing the pitch rather than the string. To make those metrics happen, first periodicity has to be evaluated with all evaluated thresholds - this is the way to know which threshold really behaves better in combination of periodicity estimation with pitch estimation:

Get the pitch for all strings at all times
Get the periodicity estimation for all the strings at all times.
Marry the two, sort the results by pitch value
Sort the labels by the pitch value
Finally sent the predicted and truth data to the metrics class, or make a new class with new RPA, RCA and RMSE objects (the latter seems to be better)

anthonio9 commented 8 months ago

This is now becoming the priority NUMBER UNO, the hope is that this new metric will allow for a better translation of the network performance to accuracy. IMPORTANT!

anthonio9 commented 8 months ago

MultiPitchMetrics() have to be called from the Metrics class and not really from anywhere else. Keep that in mind!

anthonio9 commented 8 months ago

Plots are not ready for the pitch periodicity information. I'm a bit worried that there is something wrong with the periodicity estimation based on the threshold, but let's see how the training and evaluation goes.

anthonio9 commented 8 months ago

Make the new metric plot multiple threshold outputs on one plot with wandb. Here's a bit on how to do so.

anthonio9 commented 8 months ago

Periodicity does work well, it's now clear that the network struggles tremendously with recognizing the string of source. Next step is to fix the metric, because now is seems that it outputs rubbish

anthonio9 commented 8 months ago

There's a need to make a metric that can tell if the number of present notes was predicted correctly. The metric can either say how well was the number of notes predicted or how many times was it exactly the same.

anthonio9 commented 8 months ago

RCA, RPA and RMSE require pitch in cents, however when passing pitch with possible zero values this is not that much of an option. Try to compare only the values that contain pitch, or set 0s to penn.FMIN in cents. The new metric could for a example subtract predicted values row by row from the target values and take the minimum difference as a metric. This would tell how good the prediction is no matter the order of the target or predicted values. Apparently both sets of data (predicted, true) are a bit flawed and not always all notes are in the predictions.

anthonio9 commented 8 months ago

Use this as example

anthonio9 commented 8 months ago

Above example is exactly in the range of 4.8-5.3 seconds.

anthonio9 commented 8 months ago

The simple metric shows that in over 90% cases the network does recognize the correct pitch on at least one of the strings, whenever the ground truth shows voiced data in a certain string.

Train set

Validation Set

Having read the charts, the conclusion is so that the entropy-based function estimating the periodicity of the pitch does not really work well on per-string basis. The pitch part of the network is pretty good as long as the string confusion is ignored, periodicity is not reliable at all. The highest accuracy values are thrown for cases of periodicity values being completely ignored.

With the new metric the generalization gap between the training and validation sets is much lower. It used to be around 15-20%, now this is at below 5%, for the model evaluated at the largest accuracy timestamp. Training FRCA2 around 95% and Validation FRCA2 around 92%. Great results one could say!

anthonio9 commented 8 months ago

Test FRMSE2 seems very promising on the test example from above showed around 2, that's really good!

..and indeed after a few iterations on a trained network the results are exceptional, below 100 RMSE, which never really consistently happened before. Sadly again, the lowest RMSE values are recorded when the periodicity threshold is almost completely ignored:

Both ppn-split-voiced and ppn-split-rm-d4-t10 got down to below 70 RMSE.

anthonio9 commented 8 months ago

Closing the issue, metrics FRMSE2 and FRCA2 are now ready.

anthonio9 / penn

New Metric: Pitch accuracy when all strings are merged to one #8