UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.35k stars 2.48k forks source link

How to add handcrafted features? #1147

Open mylovecc2020 opened 3 years ago

mylovecc2020 commented 3 years ago

Hi, thanks for your great works! And now hand-crafted features also work well for categorizing tasks in my field, so I wanted to combine manual features with deep features for categorizing tasks. However, the loss function 、the model, and the fit function of the training process are all encapsulated. Is there an easy way to add manual features directly to depth features? If there is no corresponding method in the existing framework, I will directly use Bi-Encoder to encode the sentence, then contact with the handcrafted feature, and finally calculate cosine similarity. Is this ok? Or use Bi-Encoder to encode the sentence, then contact with the handcrafted feature,and add full connected layer used for categorizing tasks。 Thanks!

mylovecc2020 commented 3 years ago

I define a loss_function class, and in this class, I Imitate CosineSimilarityLoss class to get the embedding of sentences. When I get embedding features and hand-crafted features I combine them, compute the loss in the next. Is this right? Do you have a better method? Thanks!

kddubey commented 3 years ago

I think you'd need to look into at least 3 pretty important questions for a concat-and-compare approach:

  1. are angles between features useful signal for your data and task? Maybe, e.g., euclidean distances are more useful.
  2. what's the scale of the features relative to each other and to the embeddings? If they're on the scale of 10-100, then they'll dominate the cosine similarity value.
  3. how many hand-crafted features are there? If there are much fewer than 768 and on a low scale, they'd be drowned out by embeddings.

Issue (2), and maybe (1), can be addressed through preprocessing of your hand-crafted features. Issue (3) can be addressed by adding a layer for dimensionality reduction of embeddings (weights tied). A problem is that these fixes seem like they'd require a lot of hyperparameter tuning.

An alternative to concat-and-compare is compare-and-concat: define a similarity metric for hand-crafted features, concatenate it with the cosine similarity from embeddings, and then learn a model from (similarity_features, similarity_embeddings) -> similarity_label.

nreimers commented 3 years ago

When you have a limited number of binary features, you can add special tokens to the tokenizer and prepend them to the text input.

So your sentence becomes e.g. "[FEAT1] [FEAT3] My first example" and another example becomes "[FEAT2] [FEAT3] [FEAT4] Another input text"

Only works when you have binary features and when an example does not have too many features.

kddubey commented 3 years ago

Hi @nreimers

Can you expand on this method a bit more? Just a bit surprised that incorporating binary features relevant to sentence similarity is really as simple as prepending a special token for each "on" binary feature. Two questions I have:

  1. Do you have an idea of how this method would perform against training via BatchHardTripletLoss?
  2. Do I have to worry about the order of the special tokens? I assume not based on this paper: https://arxiv.org/abs/2012.15180
nreimers commented 3 years ago

Hi @kddubey 1) I don't see any connection to training with BatchHardTripletLoss. BatchHardTripletLoss is a loss function, the other is how to extend you text with additional input features. 2) I think it would be good to either ensure that the special tokens are always in the same order, or to shuffle the order during training.

kddubey commented 3 years ago
  1. Ah sorry, I spoke too soon there. I was thinking about how for my particular dataset, binary features may not overlap much. If so (still need to check), each "on" feature could be treated like its own category. Then I'd label each sentence as one category and then train via BatchHardTripletLoss.
mylovecc2020 commented 3 years ago

I think you'd need to look into at least 3 pretty important questions for a concat-and-compare approach:

  1. are angles between features useful signal for your data and task? Maybe, e.g., euclidean distances are more useful.
  2. what's the scale of the features relative to each other and to the embeddings? If they're on the scale of 10-100, then they'll dominate the cosine similarity value.
  3. how many hand-crafted features are there? If there are much fewer than 768 and on a low scale, they'd be drowned out by embeddings.

Issue (2), and maybe (1), can be addressed through preprocessing of your hand-crafted features. Issue (3) can be addressed by adding a layer for dimensionality reduction of embeddings (weights tied). A problem is that these fixes seem like they'd require a lot of hyperparameter tuning.

An alternative to concat-and-compare is compare-and-concat: define a similarity metric for hand-crafted features, concatenate it with the cosine similarity from embeddings, and then learn a model from (similarity_features, similarity_embeddings) -> similarity_label.

  1. Yes, the handcrafted feature is the statistical feature of the difference comparison between two sentences, which cannot be measured by Angle. Generally, the statistical feature is directly sent into the classifier similar to logistic regression.
  2. I tried to add Softmax and normalization to preprocess and then access the full-connection layer classifier, but the effect of this method was not good, and the accuracy rate was about 0.5. The classification features are less than 20 dimensions. In BiEncoder, we consider reducing the embedding dimension to 20 dimensions and stitching it with manual features. However, the effect is not very good, only about 0.7, which is estimated to be 768- 20. Finally, I calculated the manual feature and depth feature separately and adjusted the weight of the two parts by an automatic learning balance factor-alpha. The effect was quite ideal, and the observation of alpha was interpretable to some extent.

Thanks!

mylovecc2020 commented 3 years ago

When you have a limited number of binary features, you can add special tokens to the tokenizer and prepend them to the text input.

So your sentence becomes e.g. "[FEAT1] [FEAT3] My first example" and another example becomes "[FEAT2] [FEAT3] [FEAT4] Another input text"

Only works when you have binary features and when an example does not have too many features.

I feel crazy about this approach, but it's actually much easier to implement because you can extract the features as you feed them into the string. But why binary Features? Is there any theoretical basis? I'll try that right away. Thanks a lot!

nreimers commented 3 years ago

Hi @mylovecc2020 Binary Features: Because your text has either the special token (like [FEAT3]) or not. So if you want to classify text in an online community and you want to differentiate between guest and users, you have change your text like "[GUEST] This is a post" or "[USER] This is a post".

For continuous features, this sadly doesn't work as there are infinite number of values. But there you can use binning, e.g. when you have a feature like "how long is the user registered in days", you can create bins like "[0-10days]", "[10-100days]", "[100+days]" and then have these 3 special tokens that you add to your input text