Closed cemilcengiz closed 4 years ago
Yeah it was an oversight that we didn't mention it in the paper (we'll mention it in the updated version), but we have an extra projection layer for the classifier and LM before feeding it into the classification.
However, these layers are both pre-trained with the rest of the network and are included in the pre-trained checkpoint. So the part about "the only new parameters added during fine-tuning" is correct, it's just not correct to say "output of the Transformer", it's really "output of the Transformer fed through one additional non-linear transformation".
The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.
Closing due to @sai-prasanna 's excellent answer.
Hi, while I was using AllenNLP's BertForClassification model (powered with Bert-base) on MNLI dataset, I realized it gets slightly better accuracy on the development set compared to the published results on the official paper. While investigating for possible reasons, I noticed that BertForClassification uses BertPooler class to pool the BERT encoder output before the final layer, i.e. the classifier. The interesting is that the forward() method of BertPooler contains a linear layer itself. Therefore, it means we effectively pass the [CLS] token through two-layer MLP instead of a single linear projection layer as done in the paper. I wonder why the BertPooler is implemented this way. If my understanding is true, it is impossible to compare our results with the official ones. Please correct me if I am wrong. Here is the BertPooler class:
Also, you can see the relevant part from the paper (from the 4.1 GLUE section)