gdamaskinos / unsupervised_topic_segmentation

MIT License
101 stars 19 forks source link

Not able to compare the research paper with the code #5

Open Akshayextreme opened 2 years ago

Akshayextreme commented 2 years ago

In the paper, section 3.1 BERT, it is said that - we extract a fixed sized vector via max pooling of the second to last layer. then A sentence of N words will hence esult in an N ∗ H embedding vector. The closer to the last layer, the more the semantic information carried by the weights (Zeiler et al., 2011); hence our choice of the second to last layer.

In section 3.1 Sentence-BERT - We extract fixed size sentence embeddings using a mean over all the output vectors, similar to the method we used for BERT

But in the code -

def get_features_from_sentence(batch_sentences, layer=-2):
    """
    extracts the BERT semantic representation
    from a sentence, using an averaged value of
    the `layer`-th layer

    returns a 1-dimensional tensor of size 758
    """
    batch_features = []
    for sentence in batch_sentences:
        tokens = roberta_model.encode(sentence)
        all_layers = roberta_model.extract_features(tokens, return_all_hiddens=True)
        pooling = torch.nn.AvgPool2d((len(tokens), 1))
        sentence_features = pooling(all_layers[layer])
        batch_features.append(sentence_features[0])
    return batch_features

there is only AvgPooling of second last layer. There is not max-pooling from second-to-last layer.

Is it a bug in the code?

aanchan commented 1 year ago

The maxpooling happens here : https://github.com/gdamaskinos/unsupervised_topic_segmentation/blob/cefa8ba65f964220435d35a0cb39722790f9c0b4/core.py#L75