[dattri.algorithm] Changes to the implementation of DataInf.

TRAIS-Lab / dattri

`dattri` is a PyTorch library for developing, benchmarking, and deploying efficient data attribution algorithms.

https://trais-lab.github.io/dattri/

MIT License

30 stars 9 forks source link

[dattri.algorithm] Changes to the implementation of DataInf. #123

Closed charles-pyj closed 1 month ago

charles-pyj commented 2 months ago

Description

Changes to the implementation of DataInf

1. Motivation and Context

Previous implementation is based on ifvps, which causes heavy memory overhead together with the use of vmap because in total (num_train,parameter,num_val) tensor is stored in memory.

2. Summary of the change

Remove the dependency of datainf on ifvp, instead directly calculate the influence values for smaller tensors saved in memory.
Deleted old ifvp datainf implementations
Added IFAttributorDataInf to dattri.algorithm.influence_function.py
MLP+MNIST: before: Memory usage: 26059M, Time usage: 1206s, now: 6821M, Time usage: 979s
3. What tests have been added/updated for the change?
[x] Testing transform_test_rep

charles-pyj commented 2 months ago

I give a review to the changes and it should be correct. The only concern is that this breaks the original purpose to create a BaseInnerProductAttributor.

The original purpose is that users could only implement some methods like transformation_on_query. But this implementation may reveal some problems may lie in the current implementation attribute. (I guess we can try to fix the base class later)

Please fix the document and the conflict so that we can merge this PR.

@TheaperDeng Thanks! Sorry for the late update. After discussion we have found another way that solves the memory issue but does not require implementing Datainf separately(as this PR does). I will keep that in another PR(or reopen this PR later when I finish) and for now let's just use this PR as a stand-alone implementation for testing Datainf? Plus the conflicts and ruff checks are solved.

charles-pyj commented 2 months ago

I didn't go through the full code. It seems that the current implementation will have some serious compatibility issue with the new API design outlined in #130. So we'd better wait for that to be finalized.

Meanwhile, there are quite a few typos in the docstrings. Please pay attention to them in the next iteration.

Thanks. I will refine the code when the new API design is settled.

charles-pyj commented 1 month ago

Resolved conflicts with upstream and updated docstring.

charles-pyj commented 1 month ago

One last thing: given the substantial implementation for this attributor is moved to the attributor class (instead of an ihvp function), could you add a few tests in test/algorithm/test_influence_function.py to test the implementation of this attributor? Ideally, we may want to test the both the attributor as a whole and the transform_test_rep function

Just to clarify, the test_influence_function.py does have a simple test of datainf that only tests successful running. We would like a more extended testing of the results as well right?

jiaqima commented 1 month ago

One last thing: given the substantial implementation for this attributor is moved to the attributor class (instead of an ihvp function), could you add a few tests in test/algorithm/test_influence_function.py to test the implementation of this attributor? Ideally, we may want to test the both the attributor as a whole and the transform_test_rep function

Just to clarify, the test_influence_function.py does have a simple test of datainf that only tests successful running. We would like a more extended testing of the results as well right?

Yes. I think it would be better to test the intermediate function, transform_test_rep, as well. For example, you could craft some random initial test_rep and set the train gradients to be cached to be very small (by adjusting the fim_estimate_data_ratio). Then you could pull out the cached train gradients, and calculate the transformed test_rep by directly plugging in the formula. Then you compare the calculated results with the output of the transform_test_rep.

charles-pyj commented 1 month ago

Got it! The test is added for transform test representation.

TRAIS-Lab / dattri

[dattri.algorithm] Changes to the implementation of DataInf. #123

Description

1. Motivation and Context

2. Summary of the change

3. What tests have been added/updated for the change?