UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.52k stars 2.41k forks source link

Issue in using LSTM with BERT #202

Open AnandIyer98 opened 4 years ago

AnandIyer98 commented 4 years ago

After the most recent update I'm getting error while using LSTM with BERT. Can someone please help me resolve this. image

nreimers commented 4 years ago

Didn't thought that some uses LSTM in combination with BERT.

The sentence length feature was removed from BERT feature extraction, so currently it is not compatible with LSTM.

I will update the code soon and fix it

AnandIyer98 commented 4 years ago

Thanks a lot. Though I wanted to ask you, wouldn't BERT combined with LSTM out perform SBERT? What is your opinion on this?

nreimers commented 4 years ago

I think it will not lead to an improvement, as BERT already applies attention multiple times over all inputs

AnandIyer98 commented 4 years ago

But wouldn't LSTM after bert better help capture the context of the sentence, which in turn would increase the performance on sts task? Another query, how would deep averaging network or cnn on top of BERT perform?

Thanks a lot for the clarification.

nreimers commented 4 years ago

Hi @AnandIyer98 It is sadly not that easy. A more powerful architecture does not automatically lead to better sentence embeddings. For sentence embeddings, all dimensions are treated equally and with the same weight when used with cosine similarity. If some dimensions are not well aligned, e.g. contain garbage or have a vastly different scale, they can distort extremely the sentence embeddings.

Further, BERT + LSTM is not a more powerful network. LSTM has the issue that it can only previous text, but not following text. If you use a bidirectional LSTM, then one LSTM can only see previous text, the other only the following text in the sentence. Further, LSTM has issues to model long dependencies in text. If there is an important relation between word 5 and word 25, then this information must be carried through 20 steps of LSTM. Easy to forget the information in the meantime.

BERT with the attention mechanism is the ability to capture for every word all context, i.e., for each word, we can see all previous words and all following words. Further, the distance is no longer important, it doesn't matter if the two important words are 1 step or 20 steps away.

Adding DAN or CNN on top of BERT will also not make the model more powerful.

AnandIyer98 commented 4 years ago

Thanks a lot @nreimers This was really helpful.