Open Hellisotherpeople opened 5 years ago
Hi, the pre-trained models here were trained on single (complete) sentences. I don't expect them to work well out-of-the-box for sliding windows or paragraphs (the BERT models at least).
Training with sliding windows / paragraph sized texts is possible. Note, that the BERT model has a quadratic run time and memory requirement with the number of tokens. I.e., soon training will take too much time & memory. Also, BERT is limited to 512 tokens, which might be too small for paragraph sized texts.
But you can use the average word embeddings methods, maybe combined with a CNN and/or a DAN network. They produce also quite well sentence embeddings, with a performance nearly on-par with BERT but with a fraction of the costs.
Good luck with your experiments. Let me know, if you have further questions
I've written an Extractive Summarizer called CX_DB8 which utilizes pretrained word-embedding models to summarize/semantically-search documents. It works at the word, sentence or paragraph level, and supports any pretrained model available with pytorch-transformers or offered via the Flair AI package.
My question is this: Is "sentence-transformers" suitable for training / fine-tuning with say, 10 word sliding word-windows? What about Paragraph sized texts? Are the pretrained models offered here suitable to run word-windows through them without any fine-tuning? What do you think about utilizing these sentence/word-window embeddings with the PageRank algorithm?