AI-secure / Shapley-Study

[CVPR 2021] Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification?
https://arxiv.org/abs/1911.07128v2
31 stars 3 forks source link

Shapley-Study

Setup

Use the following command to install the package:

python setup.py install

Overview

We provide 13 datasets, 5 feature extractors, 5 applications, and 6 data valuation measures for a comprehensive evaluation. The choices supported are listed below.

Datasets        mnist, fashionmnist, svhn, cifar, pubfig, tinyimagenet, usps, uci adult
Extractors      VGG11, MobileNet, ResNet18, EfficientNet, Inception-V3
Applications    Noisy label detection, Watermark removal, Data summarization, Active data acquisition, Domain adaptation
Measures        KNN-Shapley, TMC-Shapley, G-Shapley, LOO, KNN-LOO, Random

The customized datasets including injected watermarks as well as other preprocessed datasets used in the code can be found on Google Drive. You are recommended to download the folder Shapley_data and put it under the root folder (the same as samples.ipynb) for the purpose of testing.

Usage

Step 1. Apply a certain extractor to a certain dataset to extract the embeddings, implemented in the form of extract_embeddings(extractor, dataset), e.g.

python -m shapley.embedding.extract_embeddings --extractor resnet18 --dataset mnist

Step 2. Use a certain extracted embedding along with a certain measure in a certain application, implemented in the form of

measure = ...
app = ...
app.run(measure)

See samples.ipynb for the sample testcases.

Changelog

2020.12.27

Add the PyTorch implementation for KNN-Shapley calculation in shapley/measures/KNN_Shapley.py. The PyTorch implementation runs faster than the original NumPy implementation since the operations are paralleled.

One standalone experiment can be found in samples.ipynb.