Questions about Results on Question Answering, Table 3

haichao592 commented 2 years ago

In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers. For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential. Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3? If so, this implementation is a little different from the original implementation. If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.

Or, do I have some misunderstandings about the LM head in QA tasks?

Xiao9905 commented 2 years ago

Task-specific linear head is fine-tuned with prompt embeddings. The comparison of using LM head and task-specific linear head is provided in our experiment (Table 5), which shows that in a data-rich setting, a LM head's performance is not better than a co-tuned task-specific head's.

haichao592 commented 2 years ago

Task-specific linear head is fine-tuned with prompt embeddings. The comparison of using LM head and the task-specific linear head is provided in our experiment (Table 5), which shows that in a data-rich setting, an LM head's performance is not better than a co-tuned task-specific head's.

Thanks! Firstly, my concern is only about extractive question answering tasks, e.g., SQuAD. As for now, I think the LM head can not be used to produce outputs with Roberta on SQuAD and so Table 5 is not applicable. Is that right?

Secondly, I have tried PT with T5-base-v1.1 as in Lester et al. (2021) and with RoBERTa-base as described above (fine-tuning both prompt embeddings (input layer only) and task-specific QA heads). And the F1 scores exceed 80 easily without careful hyperparameters search. And the results in Table 3 are quite different. Are there any other constraints that need to be met in the implementation of PT?

Xiao9905 commented 2 years ago

Yes, LM head can not be applied to sequence tagging as for now. Your observation on PT with SQuAD is quite interesting. Have you frozen the pre-trained models' parameters? If so, could you please share your implementation to us for reference? I am also curious about why our results on PT with QA is really low.

haichao592 commented 2 years ago

Yes, LM head can not be applied to sequence tagging as for now. Your observation on PT with SQuAD is quite interesting. Have you frozen the pre-trained models' parameters? If so, could you please share your implementation to us for reference? I am also curious about why our results on PT with QA is really low.

Yes, I am sure. Only the prompt embeddings and qa heads are added to the optimizer.

I think the little code snippets are enough for it is easy to implement.

class PromptEmbedding(Module):

    def __init__(self, num_embeddings, embedding_dim):
        super().__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.weight = Parameter(torch.empty(num_embeddings, embedding_dim))
        self.reset_parameters()

    def reset_parameters(self, weight=None):
        if weight is not None:
            self.weight.data = weight.clone().detach()
        else:
            init.normal_(self.weight, mean=0, std=1.0)

    def forward(self, input):
        return torch.cat([self.weight.repeat(input.size(0), 1, 1), input], dim=1)

In RobertEmbeddings,

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings

        if hasattr(self, "prompt_embeddings"):
            embeddings = self.prompt_embeddings(embeddings)

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

Attention Mask,

        if self.config.num_prompts > 0 and self.config.prompt_tuning:
            attention_mask = torch.cat(
                [
                    torch.ones((attention_mask.size(0), self.config.num_prompts), device=device, dtype=attention_mask.dtype),
                    attention_mask
                ],
                dim=1
            )

RobertaForQuestionAnswering outputs,

        sequence_output = outputs[0]
        if self.config.num_prompts > 0:
            sequence_output = sequence_output[:, self.config.num_prompts:]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1).contiguous()
        end_logits = end_logits.squeeze(-1).contiguous()

haichao592 commented 2 years ago

@Xiao9905 Hi, could you share the hyperparameters and optimizer configuration used for PT2 SQuAD 1.1 Roberta-Large? Such as learning rate, prompt length, epoch or max steps, warmup rate, weight decay, optimizer, initialization, and so on. Thanks!

THUDM / P-tuning-v2

Questions about Results on Question Answering, Table 3 #8