Issue upon the design of adaptive attack

wangwwno1 commented 4 years ago

Hello, I'm just read the paper, and it's quite brilliant idea to apply a input transformation defence to circumvent existing advanced gradient attack without retrain model or degrading performance.

Here are some thoughts on adaptive attack which may make this work more pesuasive:

Zero-Order Attacks that requires no gradient (e.g. SPSA/ZOO)
Feature-Based Attack that may compromise the Second Property. Edit: I make a mistake here - the dimension of x is not necessarily match with f'(g(x)), but it's still possible to train a differentiable surrogate model h(x) s.t. h(x) ≈ f'(g(x)).
Surrogate Model Attack Since the adversary has full knowledge about the model and defence mechanism(except random nums), and protected model is unchanged, the distorted image may share a similiar internal representation with the original image. That means, the adversary may find an intermediate layer in the origin model s.t. f'(g(x)) ≈ f'(x) where f'() is a sub-model that include layers from the origin model, from the first layer(input layer) to a designated layer before the output.

The adversary then train an ensemble of differentiable (may be probabilistic to address the randomization) surrogate model h(x) s.t. f'(g(x)) ≈ h(x), and apply white-box attack (or BPDA?) to obtain adversarial example. Surrogate model can be directly trained as h(x) = f(g(x)), where f() is the full origin model, but to reduce the training burden, it's recommended to find an intermediate layer instead.

Stanislas0 commented 4 years ago

We're glad you have paid attention to our work and found it a brilliant idea.

For your suggestions:

We will include some gradient-free attacks as you mentioned in our future work.
It could be a possible way to compromise Property #2. It would be interesting to find out whether such kind of intermediate layer exists. We will do some experiments to test this assumption. However, after finding this layer and generate the adversarial perturbations on it, how to add it to the input remains a question. By the way, could you please provide some references for the Feature-Based Attack, in order that we can better perform this kind of attack?

wangwwno1 commented 4 years ago

Hmm, look like theFeature-Based Attack mentioned in my comment is different from the original definition, but it's worthwhile to take both(the forementioned adaptive attack, and the Feature-Based Attack) into consideration.

The original Feature-Based Attack paper: ICLR2016 Adversarial Manipulation of Deep Representations

The key concept is to match the internal representation of adversarial example with another known desired (benign or successful adv.) example

There are multiple works since then, unfortunately I'm not very familiar with them, but this paper can serve as a good start.

wangwwno1 commented 4 years ago

However, after finding this layer and generate the adversarial perturbations on it, how to add it to the input remains a question.

Sorry, I didn't make it clear, the adversarial perturbation is applied to the input, not the internal representation. We simply replace the sub-model f'(g(x)) with x in backward propagation, just like BPDA does.

wangwwno1 commented 4 years ago

BTW, another good survey paper for checking the defence mechanism: On Adaptive Attacks to Adversarial Example Defenses

Stanislas0 commented 4 years ago

However, after finding this layer and generate the adversarial perturbations on it, how to add it to the input remains a question.

Sorry, I didn't make it clear, the adversarial perturbation is applied to the input, not the internal representation. We simply replace the sub-model f'(g(x)) with x in backward propagation, just like BPDA does.

However, the sub-model f'(g(x)) doesn't necessarily have the same dimensions as x. So the gradients calculated on f'(g(x)) can't be directly applied to change the input x?

wangwwno1 commented 4 years ago

However, after finding this layer and generate the adversarial perturbations on it, how to add it to the input remains a question.

Sorry, I didn't make it clear, the adversarial perturbation is applied to the input, not the internal representation. We simply replace the sub-model f'(g(x)) with x in backward propagation, just like BPDA does.

However, the sub-model f'(g(x)) doesn't necessarily have the same dimensions as x. So the gradients calculated on f'(g(x)) can't be directly applied to change the input x?

Hmm, you are right, I made a huge mistake: the dimension of 'f'(g(x))' is mismatch with input 'x'.

My initial intuition is that, since the protected model is unchanged, the distorted image may share a similiar internal representation with the original image, hence we have 'f'(g(x)) ≈ f'(x)' regardless of the randomizataion.

If we can find another differentiable function h(x) to replace f'(g(x)), then we can directly optimize upon h(x) to get the desired adversarial example. The original initiative has h(x) = x, but since f'(g(x)) is mismatch with x, we need to find another way to circumvent it.

Edit: h(x) is somewhat like a surrogate model of f'(g(x)), but to reduce the training burden, it's recommended to find a intermediate layer instead of take the whole origin model.

YiZeng623 / Advanced-Gradient-Obfuscating

Issue upon the design of adaptive attack #2