Not able to compare the research paper with the code

In the paper, section 3.1 BERT, it is said that - we extract a fixed sized vector via max pooling of the second to last layer. then A sentence of N words will hence esult in an N ∗ H embedding vector. The closer to the last layer, the more the semantic information carried by the weights (Zeiler et al., 2011); hence our choice of the second to last layer.

In section 3.1 Sentence-BERT - We extract fixed size sentence embeddings using a mean over all the output vectors, similar to the method we used for BERT

But in the code -

def get_features_from_sentence(batch_sentences, layer=-2):
    """
    extracts the BERT semantic representation
    from a sentence, using an averaged value of
    the `layer`-th layer

    returns a 1-dimensional tensor of size 758
    """
    batch_features = []
    for sentence in batch_sentences:
        tokens = roberta_model.encode(sentence)
        all_layers = roberta_model.extract_features(tokens, return_all_hiddens=True)
        pooling = torch.nn.AvgPool2d((len(tokens), 1))
        sentence_features = pooling(all_layers[layer])
        batch_features.append(sentence_features[0])
    return batch_features

there is only AvgPooling of second last layer. There is not max-pooling from second-to-last layer.

Is it a bug in the code?

gdamaskinos / unsupervised_topic_segmentation

Not able to compare the research paper with the code #5