UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.48k stars 2.5k forks source link

Using sentence transformers for transforming words with word-windows? #31

Open Hellisotherpeople opened 5 years ago

Hellisotherpeople commented 5 years ago

I've written an Extractive Summarizer called CX_DB8 which utilizes pretrained word-embedding models to summarize/semantically-search documents. It works at the word, sentence or paragraph level, and supports any pretrained model available with pytorch-transformers or offered via the Flair AI package.

My question is this: Is "sentence-transformers" suitable for training / fine-tuning with say, 10 word sliding word-windows? What about Paragraph sized texts? Are the pretrained models offered here suitable to run word-windows through them without any fine-tuning? What do you think about utilizing these sentence/word-window embeddings with the PageRank algorithm?

nreimers commented 5 years ago

Hi, the pre-trained models here were trained on single (complete) sentences. I don't expect them to work well out-of-the-box for sliding windows or paragraphs (the BERT models at least).

Training with sliding windows / paragraph sized texts is possible. Note, that the BERT model has a quadratic run time and memory requirement with the number of tokens. I.e., soon training will take too much time & memory. Also, BERT is limited to 512 tokens, which might be too small for paragraph sized texts.

But you can use the average word embeddings methods, maybe combined with a CNN and/or a DAN network. They produce also quite well sentence embeddings, with a performance nearly on-par with BERT but with a fraction of the costs.

Good luck with your experiments. Let me know, if you have further questions