Gumbel Softmax and Softmax

facebookresearch / text-adversarial-attack

Repo for arXiv preprint "Gradient-based Adversarial Attacks against Text Transformers"

Other

96 stars 11 forks source link

Gumbel Softmax and Softmax #1

Closed SolidShen closed 3 years ago

SolidShen commented 3 years ago

Hi authors: It's a great work and I have some questions about the Gumbel-softmax and softmax. In my opinion, you are trying to optimize an adversarial discrete distribution and gumbel-softmax allows you to differentiablly draw samples from the distribution. My question is that if we just want to optimize a single adversarial example instead of a distribution, can we directly use softmax as the approximation? If yes, what are the main benefits of optimizing a distribution instead of a single example? If not, why?

Thanks! Best, Guangyu

cg563 commented 3 years ago

Hi Guangyu, I'm not sure how you are proposing to use the softmax to define a single adversarial example. The input to the transformer is a sequence of tokens, how do you parameterize the softmax so that its output is a deterministic sequence of tokens?

SolidShen commented 3 years ago

Hi Guangyu, I'm not sure how you are proposing to use the softmax to define a single adversarial example. The input to the transformer is a sequence of tokens, how do you parameterize the softmax so that its output is a deterministic sequence of tokens?

Hi, Say I want to find an adversarial token, after I add it on the benign text input, the model will provide the wrong prediction. My point is that can I optimize a [1xN] vector, then feed it to the softmax(not gumbel-softmax) and apply eq(6) in the paper, then feed it to the transformer? In this case, softmax can be considered as the approximation of the 1-hot encoding token, right? Then, what's the difference between using softmax and gumbel-softmax? What's the benefit of using gumbel-softmax in this case?

cg563 commented 3 years ago

Ah I see what you mean. What you're proposing works during optimization time, but you have to convert it to a discrete token for the target classifier. If you do something like taking the argmax of the probability vector it doesn't work, because there's nothing stopping the optimizer from taking the centroid of all tokens. Once you discretize this with argmax it becomes an arbitrary token.

The Gumbel-softmax takes care of this because there's randomness during training as well. A Gumbel-softmax distribution with large entropy can generate pretty random sentences, which doesn't help reduce the adversarial loss. The optimizer has to eventually pick a token distribution with low entropy so that samples from the Gumbel-softmax consistently fool the classifier.

This is actually a very good question and we didn't talk about this aspect of the Gumbel-softmax in the paper. We will add it in a revision, thanks!