Open haichao592 opened 2 years ago
Task-specific linear head is fine-tuned with prompt embeddings. The comparison of using LM head and task-specific linear head is provided in our experiment (Table 5), which shows that in a data-rich setting, a LM head's performance is not better than a co-tuned task-specific head's.
Task-specific linear head is fine-tuned with prompt embeddings. The comparison of using LM head and the task-specific linear head is provided in our experiment (Table 5), which shows that in a data-rich setting, an LM head's performance is not better than a co-tuned task-specific head's.
Thanks! Firstly, my concern is only about extractive question answering tasks, e.g., SQuAD. As for now, I think the LM head can not be used to produce outputs with Roberta on SQuAD and so Table 5 is not applicable. Is that right?
Secondly, I have tried PT with T5-base-v1.1 as in Lester et al. (2021) and with RoBERTa-base as described above (fine-tuning both prompt embeddings (input layer only) and task-specific QA heads). And the F1 scores exceed 80 easily without careful hyperparameters search. And the results in Table 3 are quite different. Are there any other constraints that need to be met in the implementation of PT?
Yes, LM head can not be applied to sequence tagging as for now. Your observation on PT with SQuAD is quite interesting. Have you frozen the pre-trained models' parameters? If so, could you please share your implementation to us for reference? I am also curious about why our results on PT with QA is really low.
Yes, LM head can not be applied to sequence tagging as for now. Your observation on PT with SQuAD is quite interesting. Have you frozen the pre-trained models' parameters? If so, could you please share your implementation to us for reference? I am also curious about why our results on PT with QA is really low.
Yes, I am sure. Only the prompt embeddings and qa heads are added to the optimizer.
I think the little code snippets are enough for it is easy to implement.
class PromptEmbedding(Module):
def __init__(self, num_embeddings, embedding_dim):
super().__init__()
self.num_embeddings = num_embeddings
self.embedding_dim = embedding_dim
self.weight = Parameter(torch.empty(num_embeddings, embedding_dim))
self.reset_parameters()
def reset_parameters(self, weight=None):
if weight is not None:
self.weight.data = weight.clone().detach()
else:
init.normal_(self.weight, mean=0, std=1.0)
def forward(self, input):
return torch.cat([self.weight.repeat(input.size(0), 1, 1), input], dim=1)
In RobertEmbeddings,
embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
if hasattr(self, "prompt_embeddings"):
embeddings = self.prompt_embeddings(embeddings)
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
Attention Mask,
if self.config.num_prompts > 0 and self.config.prompt_tuning:
attention_mask = torch.cat(
[
torch.ones((attention_mask.size(0), self.config.num_prompts), device=device, dtype=attention_mask.dtype),
attention_mask
],
dim=1
)
RobertaForQuestionAnswering outputs,
sequence_output = outputs[0]
if self.config.num_prompts > 0:
sequence_output = sequence_output[:, self.config.num_prompts:]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1).contiguous()
end_logits = end_logits.squeeze(-1).contiguous()
@Xiao9905 Hi, could you share the hyperparameters and optimizer configuration used for PT2 SQuAD 1.1 Roberta-Large? Such as learning rate, prompt length, epoch or max steps, warmup rate, weight decay, optimizer, initialization, and so on. Thanks!
In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers. For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential. Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3? If so, this implementation is a little different from the original implementation. If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.
Or, do I have some misunderstandings about the LM head in QA tasks?