Log ensembled metrics - Githubissues

This PR introduces three "modes" for ensembling the different credences output by a reporter for a single cluster or datapoint.

none is no ensembling, or what we do right now.
partial means we ensemble across contrast pairs, similar to what was done in the DLK paper. Specifically, the ensembled credence is pos - neg where pos is the credence on the positive pseudolabel and neg is the credence on the negative pseudolabel.
full ensembles both across contrast pairs and across different prompt templates. Specifically, we take the average of the credences computed across all prompt templates and use that as the prediction for the cluster.

These are implemented as options on the evaluate_preds function. The elicit and eval commands log metrics for all 3 values.

It turns out this ensembling can matter quite a bit; full is often on the order of 10 percentage points higher AUROC / acc than none.

EleutherAI / elk