Closed shanefeng123 closed 1 year ago
Hi Shane,
Thanks for raising this issue.
Our attack only operates on the gradients and the model weights, and is agnostic to the fine-tuning stage of the model. This means that you can even employ our attack on a pre-trained model directly, executing it during its first fine-tuning step. (i.e., it is not a requirement to have a language model fine-tuned on the datasets in order to carry out the attack.)
However, as we noted in our paper (Figure 4 on page 8), our attack becomes stronger towards the later stage of fine-tuning.
Let me know if you have further questions! Otherwise, please feel free to close the issue if you consider it resolved.
Best, Yangsibo
Hi Yangsibo,
Thanks of your reply. I understand the attack can work better when the model is more fine-tuned as the probability distribution is more accurate.
Can I also ask about your approach to recover the bag of words as your first step of the attack? In your paper, you mention that you follow the method by Melis, which extract the tokens that has non-zero gradients in the token embedding layer. However, in GPT2, they tie the token embedding layer to the last linear layer, resulting that all tokens have some gradients. How do you extract the bag of words here? Do you have a cutoff value of the gradient norm?
Best, Shane
Hi Yangsibo,
Thanks of your reply. I understand the attack can work better when the model is more fine-tuned as the probability distribution is more accurate.
Can I also ask about your approach to recover the bag of words as your first step of the attack? In your paper, you mention that you follow the method by Melis, which extract the tokens that has non-zero gradients in the token embedding layer. However, in GPT2, they tie the token embedding layer to the last linear layer, resulting that all tokens have some gradients. How do you extract the bag of words here? Do you have a cutoff value of the gradient norm?
Best, Shane
do you sovle this problem? I am also confused with how to recover the bag of words for gpt2 model
Hello,
The norms of the rows in the word embeddings for tokens in the BoW are much larger than the others.
See the demo code below for a short example
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model.forward(**encoded_input, labels=encoded_input["input_ids"])
output.loss.backward()
# print norms of gradients of embeddings
torch.set_printoptions(profile="full")
print(model.get_input_embeddings().weight.grad.norm(dim=1)[encoded_input["input_ids"]])
Hi,
Thanks for providing the implementation of your excellent paper.
In your repo description, I see you said:
"We currently do not provide any models fine-tuned on the datasets (will be added to the repository at a later data).
For now, you may use pre-trained models from HuggingFace and fine-tune on the provided datasets.".
Do we need to have a language model fine tuned on the datasets to be able to perform the attack? Which part is using it? I thought you only need a language model trained with the batch to perform beam search?
Thanks, Shane