Closed norabelrose closed 1 year ago
Looks great to me, as long as you are fine with the fact that the cal_acc computation still uses both the positive and negative questions even in the "None" ensembling method.
Ummm yeah I think that is expected behavior because the accuracy is supposed to be like "top 1 accuracy" for the question, not accuracy for classifying "Q? A" statements as true or false. So inherently you do need to include the credences for both pseudolabels.
This PR introduces three "modes" for ensembling the different credences output by a reporter for a single cluster or datapoint.
none
is no ensembling, or what we do right now.partial
means we ensemble across contrast pairs, similar to what was done in the DLK paper. Specifically, the ensembled credence ispos - neg
wherepos
is the credence on the positive pseudolabel andneg
is the credence on the negative pseudolabel.full
ensembles both across contrast pairs and across different prompt templates. Specifically, we take the average of the credences computed across all prompt templates and use that as the prediction for the cluster.These are implemented as options on the
evaluate_preds
function. Theelicit
andeval
commands log metrics for all 3 values.It turns out this ensembling can matter quite a bit;
full
is often on the order of 10 percentage points higher AUROC / acc thannone
.