Tutorial for using data valuation to select examples for in-context learning

We should create a tutorial showing how to use data valuation to select examples for in-context learning. This should be similar to what was done in the paper "Data Curation Alone Can Stabilize In-context Learning". The paper uses as data value for a specific example the average accuracy on the validation set for all subsets in which this example appears.

$$ s{\text{ca}}(i) = \mathbb{E}{\mathcal{Z} \sim D_{\text{ICL}}} \left[ \text{Acc}(Z) | (x_i, y_i) \in \mathcal{Z} \right] $$

Where $i$ is the example's index, $(x_i, y_i)$ is the example's input and output, $\mathcal{Z}$ is the prompt.

The authors show, see Appendix A.1 of the paper, that the Data Shapley value is proportional to this value.

We can do it differently using pyDVL by simply computing Shapley or Banzhaf values.

Here are some of the considerations we have to take into account:

We can not put all available examples in the prompt due to context length limitations. This means that we should probably create a new sampler class or post-processor that filters the generated samples to remove subsets with a size greater than the limit.
We should consider the order in which examples appear in the prompt. This could make the computation scale much worse.

aai-institute / pyDVL

Tutorial for using data valuation to select examples for in-context learning #608