About the shape of BERT output

According to the Section 3.3.1 in the paper, the input of BERT consists of query and context whose length should be seq_len = n+m+2 and the output drops the representations of the query and special tokens. However, in bert_query_ner.py line 44 and 45, the shape of sequence_heatmap coming from BERT output is [batch_size, seq_len, hidden_size], which conflicts with the papers. So which method should be applied? How much difference between both(in performance)?

bert_outputs = self.bert(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

sequence_heatmap = bert_outputs[0]  # [batch, seq_len, hidden]
batch_size, seq_len, hid_size = sequence_heatmap.size()

start_logits = self.start_outputs(sequence_heatmap).squeeze(-1)  # [batch, seq_len, 1]
end_logits = self.end_outputs(sequence_heatmap).squeeze(-1)  # [batch, seq_len, 1]

ShannonAI / mrc-for-flat-nested-ner

About the shape of BERT output #84