SonyResearch / IDEAL

Query-Efficient Data-Free Learning from Black-Box Models
19 stars 1 forks source link

Question regarding the method of counting query #1

Open votrinhan88 opened 1 year ago

votrinhan88 commented 1 year ago

Issue Description

I've noticed an issue with the query counting in the black-box model, where it seems to be lower than the actual query count. Specifically, in the ideal.py -> kd_train -> cal_label step, where queries are made, the counting should be performed every time the black-box model is called.

  1. Based on my understanding, the current implementation returns the size of data pool as the query count. However, since the data pool accumulates over time through synthesizer.gen_data(), the actual number of query also accumulates.
  2. Additionally, the loaded generated images undergo augmentation kd_train -> synthesizer.get_data() -> datasets = self.data_pool.get_dataset(transform=self.transform, ...). Since the prediction of the black-box model for the original image and each augmented versions may differ, they should be counted as separate queries.

If it has been my misunderstanding, it'll be great if you can elaborate on how the query is counted. Thanks in advance.

Expected vs Current Behavior

Considering a batch size of $B$ and number of training epochs $E$:

The expected number of queries across different datasets would be: Dataset $B$ $E$ Reported $Q$ Expected $Q$
MNIST 250 100 25K 1.26M
FMNIST, SHVN 250 400 100K 20.05M
CIFAR10, ImageNet Subset 250 1000 250K 125.1M
CIFAR100, TinyImageNet 1000 2000 2M 2B

Steps to Reproduce

To count the query, I have used a counter variable and count every time ideal.py -> kd_train -> cal_label is called.

Specifically, in ideal.py:

...
def kd_train(synthesizer, model, optimizer, criterion, query): # Add a parameter 'query'
    ...
    label = cal_label(blackBox_net, images)
    query.add_(images.shape[0]) # Increment the counter right after calling the black-box model 
    ...
...
if __name__ == '__main__':
    query = torch.tensor(0, dtype=torch.int) # Initialize a Tensor variable to store the number of queries
    ...
    for epoch in tqdm(range(args.epochs)):
        ...
        cls_list_counter = kd_train(synthesizer, [sub_net, blackBox_net], optimizer, criterion, query=query) # Pass counter into the distillation step
        ...
        print_log("Dataset:{}, Epoch: {}, Accuracy of the substitute model:{:.3} %, best accuracy:{:.3} %, query {} \n".format(
            args.dataset, epoch, acc, best_acc, query.item()), log) # Log the query count
        ...

Possible Solution

  1. Add the counter as suggested in the 'Steps to Reproduce' section.
  2. Alternatively, I'd also suggest saving the generated images with their labels to the data pool and not querying the teacher again after loading from the pool.
892446631 commented 6 months ago

I agree with your view; the number of queries has far exceeded what was reported in the paper, and the same paper has already been published at CVPR 2022 (with the same author). There is almost no difference between the two papers, and the code released at CVPR 2022 is the same as the one here, with the same issues. paper : Towards Efficient Data Free Black-box Adversarial Attack github: https://github.com/zj-jayzhang/Data-Free-Transfer-Attack