The net:cal calibration framework is a Python 3 library for measuring and mitigating miscalibration of uncertainty estimates, e.g., by a neural network.
Thank you for the time you put into this repo and for open sourcing your code!
I have never used netcal before, and so I found myself comparing it to other libraries/pieces of code that do similar things. Concerning the visualisation function(s), specifically netcal.presentation.ReliabilityDiagram, I was wondering: is the quantity you plot on the y axis really the accuracy, or is it the relative frequency of positive examples in each bin (as, from my understanding, it should be in calibration curves)?
Checking the code here, in particular this snippet:
for batch_X, batch_matched, batch_hist, batch_median in zip(X, matched, histograms, median_confidence):
acc_hist, conf_hist, _, num_samples_hist = batch_hist
empty_bins, = np.nonzero(num_samples_hist == 0)
# calculate overall mean accuracy and confidence
mean_acc.append(np.mean(batch_matched))
mean_conf.append(np.mean(batch_X))
assuming batch_matched stores the ground truth labels for each batch, I am pretty confident that should not be named "accuracy" (still - I confess I have not spent a lot of time trying to understanding perfectly what the various function should return).
I have also tried to compare the results from netcal with scikit-learn calibration_curve function - whose documentation state returns "the proportion of samples whose class is the positive class, in each bin (fraction of positives)", and the results look very similar, if not identical, to what I get with netcal.
Dear Fabian,
Thank you for the time you put into this repo and for open sourcing your code!
I have never used
netcal
before, and so I found myself comparing it to other libraries/pieces of code that do similar things. Concerning the visualisation function(s), specificallynetcal.presentation.ReliabilityDiagram
, I was wondering: is the quantity you plot on the y axis really the accuracy, or is it the relative frequency of positive examples in each bin (as, from my understanding, it should be in calibration curves)?Checking the code here, in particular this snippet:
assuming
batch_matched
stores the ground truth labels for each batch, I am pretty confident that should not be named "accuracy" (still - I confess I have not spent a lot of time trying to understanding perfectly what the various function should return).I have also tried to compare the results from
netcal
with scikit-learncalibration_curve
function - whose documentation state returns "the proportion of samples whose class is the positive class, in each bin (fraction of positives)", and the results look very similar, if not identical, to what I get withnetcal
.It would be amazing if you could clarify this!
Cheers, Dennis.