TRAIS-Lab / dattri

`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.
https://trais-lab.github.io/dattri/
24 stars 8 forks source link

[dattri.model_utils] Add a subsetsampler by using tfidf to calculate the similarity of text #73

Open SeanZh30 opened 3 months ago

SeanZh30 commented 3 months ago

Description

1. Motivation and Context

In article Studying Large Language Model Generalization with Influence Functions section 3.2.1, they proposed a method by using tfidf to calculate the similarity between test data and training data. Select subsets with high similarity to calculate influence values, thereby reducing computational complexity. For our benchmark nanoGPT, it could be a feasible method.

2. Summary of the change

Add a tfidf_subset_sampler in model_utils.

3. What tests have been added/updated for the change?

N.A.