Closed anthonio9 closed 8 months ago
The new metric can be called Full RPA, Full RCA, Full RMSE, I'm not sure what's better. There are thresholds set for periodicity, and again those thresholds are not that accurate (60-80%), it could be due to the network just being better at recognizing the pitch rather than the string. To make those metrics happen, first periodicity has to be evaluated with all evaluated thresholds - this is the way to know which threshold really behaves better in combination of periodicity estimation with pitch estimation:
This is now becoming the priority NUMBER UNO, the hope is that this new metric will allow for a better translation of the network performance to accuracy. IMPORTANT!
MultiPitchMetrics()
have to be called from the Metrics
class and not really from anywhere else. Keep that in mind!
Plots are not ready for the pitch periodicity information. I'm a bit worried that there is something wrong with the periodicity estimation based on the threshold, but let's see how the training and evaluation goes.
Make the new metric plot multiple threshold outputs on one plot with wandb. Here's a bit on how to do so.
Periodicity does work well, it's now clear that the network struggles tremendously with recognizing the string of source. Next step is to fix the metric, because now is seems that it outputs rubbish
There's a need to make a metric that can tell if the number of present notes was predicted correctly. The metric can either say how well was the number of notes predicted or how many times was it exactly the same.
RCA, RPA and RMSE require pitch in cents, however when passing pitch with possible zero values this is not that much of an option. Try to compare only the values that contain pitch, or set 0s to penn.FMIN in cents. The new metric could for a example subtract predicted values row by row from the target values and take the minimum difference as a metric. This would tell how good the prediction is no matter the order of the target or predicted values. Apparently both sets of data (predicted, true) are a bit flawed and not always all notes are in the predictions.
Use this as example
Above example is exactly in the range of 4.8-5.3 seconds.
The simple metric shows that in over 90% cases the network does recognize the correct pitch on at least one of the strings, whenever the ground truth shows voiced data in a certain string.
Train set
Validation Set
Having read the charts, the conclusion is so that the entropy-based function estimating the periodicity of the pitch does not really work well on per-string basis. The pitch part of the network is pretty good as long as the string confusion is ignored, periodicity is not reliable at all. The highest accuracy values are thrown for cases of periodicity values being completely ignored.
With the new metric the generalization gap between the training and validation sets is much lower. It used to be around 15-20%, now this is at below 5%, for the model evaluated at the largest accuracy timestamp. Training FRCA2 around 95% and Validation FRCA2 around 92%. Great results one could say!
Test FRMSE2 seems very promising on the test example from above showed around 2, that's really good!
..and indeed after a few iterations on a trained network the results are exceptional, below 100 RMSE, which never really consistently happened before. Sadly again, the lowest RMSE values are recorded when the periodicity threshold is almost completely ignored:
Both ppn-split-voiced
and ppn-split-rm-d4-t10
got down to below 70 RMSE.
Closing the issue, metrics FRMSE2 and FRCA2 are now ready.
Poly Pitch Net models are doing quite well with pitch recognition (up to 82%), however one of the main problems remains the distinction between the strings. Create a new metric that will ignore the strings distinction and will merge all string to one, then check the pitch accuracy, RPA, RCA, RMSE. This new metric will tell how good the pitch recognition is overall, instead of how good the network is at the task of pitch recognition and string separation at the same time