EleutherAI / elk

Keeping language models honest by directly eliciting knowledge encoded in their activations.
MIT License
178 stars 33 forks source link

Log ensembled metrics #215

Closed norabelrose closed 1 year ago

norabelrose commented 1 year ago

This PR introduces three "modes" for ensembling the different credences output by a reporter for a single cluster or datapoint.

  1. none is no ensembling, or what we do right now.
  2. partial means we ensemble across contrast pairs, similar to what was done in the DLK paper. Specifically, the ensembled credence is pos - neg where pos is the credence on the positive pseudolabel and neg is the credence on the negative pseudolabel.
  3. full ensembles both across contrast pairs and across different prompt templates. Specifically, we take the average of the credences computed across all prompt templates and use that as the prediction for the cluster.

These are implemented as options on the evaluate_preds function. The elicit and eval commands log metrics for all 3 values.

It turns out this ensembling can matter quite a bit; full is often on the order of 10 percentage points higher AUROC / acc than none.

norabelrose commented 1 year ago

Looks great to me, as long as you are fine with the fact that the cal_acc computation still uses both the positive and negative questions even in the "None" ensembling method.

Ummm yeah I think that is expected behavior because the accuracy is supposed to be like "top 1 accuracy" for the question, not accuracy for classifying "Q? A" statements as true or false. So inherently you do need to include the credences for both pseudolabels.