In the paper, section 3.1 BERT, it is said that -we extract a fixed sized vector via max pooling of the second to last layer. then A sentence of N words will hence esult in an N ∗ H embedding vector. The closer to the last layer, the more the semantic information carried by the weights (Zeiler et al., 2011); hence our choice of the second to last layer.
In section 3.1 Sentence-BERT -We extract fixed size sentence embeddings using a mean over all the output vectors, similar to the method we used for BERT
But in the code -
def get_features_from_sentence(batch_sentences, layer=-2):
"""
extracts the BERT semantic representation
from a sentence, using an averaged value of
the `layer`-th layer
returns a 1-dimensional tensor of size 758
"""
batch_features = []
for sentence in batch_sentences:
tokens = roberta_model.encode(sentence)
all_layers = roberta_model.extract_features(tokens, return_all_hiddens=True)
pooling = torch.nn.AvgPool2d((len(tokens), 1))
sentence_features = pooling(all_layers[layer])
batch_features.append(sentence_features[0])
return batch_features
there is only AvgPooling of second last layer. There is not max-pooling from second-to-last layer.
In the paper, section 3.1 BERT, it is said that -
we extract a fixed sized vector via max pooling of the second to last layer.
thenA sentence of N words will hence esult in an N ∗ H embedding vector. The closer to the last layer, the more the semantic information carried by the weights (Zeiler et al., 2011); hence our choice of the second to last layer.
In section 3.1 Sentence-BERT -
We extract fixed size sentence embeddings using a mean over all the output vectors, similar to the method we used for BERT
But in the code -
there is only AvgPooling of second last layer. There is not max-pooling from second-to-last layer.
Is it a bug in the code?