Open svjack opened 3 years ago
@svjack Yes, some documentation will follow soon. Still have to figure out what the best construction and initialization for asym. models. Until then, an example is in the release notes.
Hi Nils,
I used the Asym model based on the example in the release notes. I can observe that for my custom task EmbeddingSimilarityEvaluator performance (Cosine similarity Spearman) went from .7348 to .0136. Wondering if you observed such drastic performance drop due to this switch on any of your tasks? No other changes in anything - dataset, training regime etc. I used a custom pretrained xlm-roberta-large with mean pooling and finetuned it along with the added layer(s).
dense_model_A = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=output_embed_dim, activation_function=nn.Tanh())
dense_model_B = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=output_embed_dim, activation_function=nn.Tanh())
asym_model = models.Asym({'QRY': [dense_model_A], 'DOC': [dense_model_B]}) model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])
Thanks.
Hi @LeenaShekhar Training the asym. models can be tricky and I don't have a recommended solution yet.
Both models share the same transformer network in that example. Hence, at the start, if a sentence for 'QRY' and 'DOC' is identical, they are mapped to the same point in vector space.
However, as the dense layers are different, they are then moved to completely different locations in vector space. Backpropagation updates not only the dense layer, but also the transformer layers. So the optimizer fails to find a nice configuration for a shared transformer layer but different dense layers.
Here are some options how to solve it:
asym_model = models.Asym({'QRY': [word_a, pooling_a, dense_model_A], 'DOC': [word_b, pooling_b, dense_model_B]}) model = SentenceTransformer(modules=[asym_model])
Thanks Nils. These points make sense; I was thinking all the same lines and am doing expt similar to 1 and 3. For 2 where you talked about using two independent transformers, I did the code change as you mentioned but am getting the following error:
File "/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 524, in fit data = next(data_iterator) File "/home/string/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/string/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/string/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/ads-nfs-2/string//lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 394, in smart_batching_collate tokenized = self.tokenize(texts[idx]) File "/ads-nfs-2/string//lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 325, in tokenize return self._first_module().tokenize(text) File "/home/string/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in getattr type(self).name, name)) torch.nn.modules.module.ModuleAttributeError: 'Asym' object has no attribute 'tokenize'
Let me know if you have an idea why this is happening; I will look at the code in details too.
@LeenaShekhar Thanks for pointing this out. The Asym model did not have a tokenize method.
I added this method to the model, but it is not yet part of the pypi release. You can install the package here from sources
Thank you so much. I have tested it, and it seems to work now.
Did this work for training in the end? I tried out the functionality for encoding and querying, however did not manage to train a model based on the example provided got the following error:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'sentence_transformers.readers.InputExample.InputExample'>
Is there any example already available/ any tips to train the models in asym fashion?
Looks like you pass an InputExample at the wrong place. For encoding, you don't have to wrap it into an InputExample, just pass a dict in the format {'doc': 'your text'}
Looks like you pass an InputExample at the wrong place. For encoding, you don't have to wrap it into an InputExample, just pass a dict in the format {'doc': 'your text'}
This is not an issue, encoding works very well.
The problem is when I try to fit a model
I have the list of input examples
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8), InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
Which I then pass into Dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
Now the error appears, calling the fit function:
model.fit(train_objectives=[(train_dataloader)], evaluator=evaluator, epochs=num_epochs, evaluation_steps=1000, warmup_steps=warmup_steps)
Instead of strings, you have to use dicts in the InputExamples
To get this straight: the model is comparing 'my first sentence' to 'my second sentence'? It learns that the correlation between the two is 80%?
the model is comparing 'my first sentence' to 'my second sentence'?
Yes
It learns that the correlation between the two is 80%?
Not the correlation but the cosine similarity.
What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?
I am trying to use the model to tackle an entity-resolution problem - my goal is to fine-tune the model on a very large dataset, with the goal of making it familiar with the structure of the data (which is a concatenation of numerous text columns) and then, extract embedding vectors from the entire dataset and find those records with the highest cosine similarity score, and group them together as one single entity.
Is that approach feasible with sentence-transformers or am I getting it entirely wrong? My concern is that, because it relies on the semantics, if I concatenate large sentences together, the model will lose its power.
I am trying to use the model to tackle an entity-resolution problem - my goal is to fine-tune the model on a very large dataset, with the goal of making it familiar with the structure of the data (which is a concatenation of numerous text columns) and then, extract embedding vectors from the entire dataset and find those records with the highest cosine similarity score, and group them together as one single entity.
Is that approach feasible with sentence-transformers or am I getting it entirely wrong? My concern is that, because it relies on the semantics, if I concatenate large sentences together, the model will lose its power.
You say “ concatenation of numerous text columns“ Is this a structure of database table, can you give me a concrete project based on these data structure or some paper ?
What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?
Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models.
We will soon release some code that allows to train without labels. The improvement depends extremely on the domain.
What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?
Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models.
We will soon release some code that allows to train without labels. The improvement depends extremely on the domain.
i hope this may not in the sense of use crossencoder to generate labels.
What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?
Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models.
We will soon release some code that allows to train without labels. The improvement depends extremely on the domain.
Which unsupervised method you will use ? can you introduce me some paper? Thanks
Here are two papers on unsupervised sentence embeddings learning: https://openreview.net/forum?id=Ov_sMNau-PF https://arxiv.org/abs/2006.03659
Here are two papers on unsupervised sentence embeddings learning: https://openreview.net/forum?id=Ov_sMNau-PF https://arxiv.org/abs/2006.03659
Sparse vector features such as bm25 vector or tfidf vector seems can not used for search directly. When use SVD (or Autoencoder) the vector can have some sense but only maintain topic features. It seems like these unsupervised sparse features are worse than use bm25 score from search engine. Does there exists a method to produce sparse features that can beat bm25 score ? Or some project can learn to transform some sparse features to dense that can works like sbert ? Not simple by weighted average word embedding but can contain some sequence info and train in supervised way ? May be a metric to matric learning model .(sparse metric align to dense metric produced by sbert) ? Does this kind of metric surprised model exists ?
Bm25 and tf idf are perfect for Search and are quite hard to beat in the general setup over all possible use cases.
Hi @LeenaShekhar Training the asym. models can be tricky and I don't have a recommended solution yet.
Both models share the same transformer network in that example. Hence, at the start, if a sentence for 'QRY' and 'DOC' is identical, they are mapped to the same point in vector space.
However, as the dense layers are different, they are then moved to completely different locations in vector space. Backpropagation updates not only the dense layer, but also the transformer layers. So the optimizer fails to find a nice configuration for a shared transformer layer but different dense layers.
Here are some options how to solve it:
- Initialize dense_model_A and dense_model_B with the same weights. Hence, in the beginning, there is no difference. You can initialize it either with the same (random) weights or with a torch.eye() matrix (i.e. dense layer will not change the embedding at the beginning)
- Instead of shared transformer layer, use two independent transformer, pooling, and dense layer:
asym_model = models.Asym({'QRY': [word_a, pooling_a, dense_model_A], 'DOC': [word_b, pooling_b, dense_model_B]}) model = SentenceTransformer(modules=[asym_model])
- You can also try to first freeze the shared transformer layer so that the two Dense layers can learn a mapping. Then allow to also update the transformer layer.
If my understanding is correct, can the asym model class be used to train a DPR model (different query/context encoder) using the MNRL loss (with different transformer backbones and without a dense model after the pooling layer)?
@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR
@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR
Are you aware of any references for some existing work on it on the top of your mind?
@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR
Are you aware of any references for some existing work on it on the top of your mind?
Tried it in this paper & found asym to be worse:
So what I'm understanding from this is that, although the DPR examples in the documentation use separate context and question encoders (facebook-dpr-ctx_encoder-single-nq-base and facebook-dpr-question_encoder-single-nq-base), that if we wanted to fine tune a model for that purpose we'd be better off using a single model. Maybe a Multi-QA model would be a good base model to start from, using the MNRL loss.
Is this correct?
If that's the case, I'm looking for a guide to help me do that
I think the documentation should have a section discuss different construction instance about Asym in different dataset and tasks. You can only release some paper or guidelines materials about the different constructions.