TRAIS-Lab / dattri

`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.
https://trais-lab.github.io/dattri/
24 stars 8 forks source link

[dattri.model_utils] Add tfidf sampler #75

Closed SeanZh30 closed 2 months ago

SeanZh30 commented 3 months ago

Description

This module provides function called tfidf_subset_sampler to sample subsets based on TF-IDF similarity and save the filterd train data and return the indices of orignial train set. It is designed for benchmark nanoGPT.

1. Motivation and Context

To solve the problem in issue #73.

2. Summary of the change

Add a new file called tfidf_sampler.py under model_utils. The callable function tfidf_subset_sampler is included in it

3. What tests have been added/updated for the change?

tingwl0122 commented 3 months ago

I guess we can add a simple test file to test the functionality?

jiaqima commented 2 months ago

Not sure which operation of mine closed this PR. And I cannot reopen it. Please make start a new PR for this.