aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
109 stars 8 forks source link

Tutorial for using data valuation to select examples for in-context learning #608

Open AnesBenmerzoug opened 4 months ago

AnesBenmerzoug commented 4 months ago

We should create a tutorial showing how to use data valuation to select examples for in-context learning. This should be similar to what was done in the paper "Data Curation Alone Can Stabilize In-context Learning". The paper uses as data value for a specific example the average accuracy on the validation set for all subsets in which this example appears.

$$ s{\text{ca}}(i) = \mathbb{E}{\mathcal{Z} \sim D_{\text{ICL}}} \left[ \text{Acc}(Z) | (x_i, y_i) \in \mathcal{Z} \right] $$

Where $i$ is the example's index, $(x_i, y_i)$ is the example's input and output, $\mathcal{Z}$ is the prompt.

The authors show, see Appendix A.1 of the paper, that the Data Shapley value is proportional to this value.

We can do it differently using pyDVL by simply computing Shapley or Banzhaf values.

Here are some of the considerations we have to take into account: