Simple way of producing two independent embeddings

fjhheras commented 4 years ago

I would like to finetune BERT (or similar) models for an asymmetric task using two different embeddings. There will be two inputs (1 and 2), and I would use an embedding in 1 and an embedding in 2 to build meaningful distances between 1 and 2. But I cannot use a common embedding, because sentences in 1 are of very different nature from sentences in 2 (it is not exactly like that, but you can think of questions and answers)

I have thought about several options:

Training two independent transformer models (one for 1, the other for 2)

Input 1 >> transformer 1 >> Pooling >> Output 1 Input 2 >> transformer 2 >> Pooling >> Output 2

Training the same transformer model, but adding extra layers to one of them. For example:

Input 1 >> transformer >> Pooling >> Output 1 Input 2 >> transformer >> Pooling >> extra layer >> Output 2

or

Input 1 >> transformer >> Pooling >> Output 1 Input 2 >> transformer >> extra layer >> Pooling >> Output 2

Do you think there is an easy way to do this by adapting one of the training scripts? I would appreciate some guidance about what codes I can try to adapt in my use case, so I can make the most of the code that is already in this repo!

fjhheras commented 4 years ago

I found some previous answers that were relevant, and even if they do not give all the details, I managed to get something working. I added a last layer to the transformer with several Dense instances. Then, depending on the value of self.condition, one instance is chosen:

    def __init__(self, in_features, out_features, bias=True,
                 activation_function=nn.Tanh(), conditions=None):
        ........
        self.conditions = conditions
        dense_dict = {key: Dense(in_features, out_features, bias=bias,
                                 activation_function=activation_function)
                      for key in self.conditions}
        self.dense_dict = nn.ModuleDict(dense_dict)
    def forward(self, features):
        return self.dense_dict[self.condition].forward(features)

So when I encode from 1, I put first list(module.children())[-1].condition='1', etc. It is not beautiful (monkey patching), but it works. If I write a PR to make a layer like this, would you be interested?

I had to make other changes in CosineSimilariryLoss.py and EmbeddingSimilarityEvaluator.py, (to change condition before each call to encode).

nreimers commented 4 years ago

Hi @fjhheras Yes, if a nice and clean integration of that would be quite cool.

I think the best way is to integration the information on the condition into the dataloader. This information is passed to all intermediate modules and could be read from there.

Best Nils

fjhheras commented 4 years ago

Thank you for your answer, @nreimers

How would you send information to all modules?

For example, SentenceTransformer.encode calls self.forward(features). This forward is inherited from nn.Sequential, so it sends all the arguments to the first module (in the case I am testing modules/BERT), which does self.bert(**features), where self.bert is a huggingface transformer.

If I add the key text_type to the dictionary features it fails with an error because the huggingface transformer does not have that keyword argument. Even if the last module has that key, it will not work.

fjhheras commented 4 years ago

I can bypass the first module by creating a method forward in SentenceTransformer:

    def forward(self, features, intermediate_features=None):
        for i, module in enumerate(self):
            if i == 1 and intermediate_features is not None:
                features.update(intermediate_features)
            features = module(features)
        return features

not sure how general or desirable this would be (and still not sure how to do the equivalent for training)...

nreimers commented 4 years ago

Hi @fjhheras My idea was more to inject into the features array a new key, like: features['condition'] = 1

Than in the dense layer, you can check for features['condition'] and either pass it through an identity layer or through a non-linear dense layer.

But I'm not sure yet how to get the datareader so that it can add features to input text, which are preserved withing the sequential pipeline.

fjhheras commented 4 years ago

Yes, I understood your suggestion. But the first module does not seem to accept an extra key in the features dictionary. At least in the way it is called in SentenceTransformer.encode

umairspn commented 4 years ago

I am stuck in the same situation where I want to:

Training two independent transformer models (one for 1, the other for 2) Input 1 >> transformer 1 >> Pooling >> Output 1 Input 2 >> transformer 2 >> Pooling >> Output 2

Any help will be appreciated. Thank You!

nreimers commented 4 years ago

Just for completeness: See #328, there I comment an easy method to create two independent embeddings for different inputs without needing any code changes.

UKPLab / sentence-transformers

Simple way of producing two independent embeddings #238