UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.25k stars 2.47k forks source link

Has someone tried adding an Auto Encoder on top of S-BERT? #183

Open praateekmahajan opened 4 years ago

praateekmahajan commented 4 years ago

Hi all, First of all, to the authors great work on the paper and repo :)

This repo might or might not be the best place to post it, but was wondering if someone had tried adding an Auto Encoder on top of the output of S-BERT?

Something like a VAE?

The intuition is that forcing an AE on top of S-BERT forces it to reduce dimension. With an MSE loss, hopefully the vector in reduced dimension captures all information.

The benefits to name a few are :

  1. Reduced dimension, means that vector search would be faster, all other downstream compute would be faster.
  2. Something like VAE can even also open up new areas of research where we can see that some latent variables contribute to funny, whereas other contributes to political..

And hopefully a VAE can generalise to unseen data.

I tried a PCA on some inhouse dataset, and saw

Saw an SSE of 525, 100 respectively on Test Set of 100 examples, which might not be that bad given that each of the 100 examples has 768 dimensions.

The results are somewhat motivating, and I was wondering if this topic in particular interests someone. Or if someone knows of active/past research in this field.

Would be happy to contribute to it in whichever way possible.

nreimers commented 4 years ago

Sounds like an interesting idea. I think it would be interesting to see if the original properties are kept, for example, that similar sentences are close in vector space.

ExeCuteRunrunrun commented 4 years ago

@praateekmahajan I was just thinking about VAE and encountered your post here ! pls do let me know if you have further ideas ! I am using S-BERT to encode questions and articles then calculate the similarities to rank the closest articles. I compared question-title, question-paragraph and question-article pairs, then I found the question-paragraph pairs give more interesting results. But the default is that

  1. both encoding a question and loading the embeddings of paragraphs will take long time, which makes the real-time use impossible.
  2. if I give a short question, the model tends to choose short paragraphs, and miss the longer but more interesting paragraphs.

So naturally I'm thinking about VAE to

  1. reduce embedding dimensions as you said,
  2. summary the question and the paragraphs in an abstractive way so that both representations of them are latent, and eventually longer paragraphs could be considered when given short questions.

I might imagine too much for the second point but I'm really happy to hear your advise !

Bk073 commented 1 year ago

Has anyone worked on this? Auto-encoder using pre-trained BERT