MaartenGr / KeyBERT

Minimal keyword extraction with BERT
https://MaartenGr.github.io/KeyBERT/
MIT License
3.31k stars 337 forks source link

Question: What is the ideal doc size to use? #140

Closed RumiAllbert closed 1 year ago

RumiAllbert commented 1 year ago

Hi, Is there a limit to the doc size that should be used? OR are there any recommendations? Will it be okay to use both long and short documents?

MaartenGr commented 1 year ago

Is there a limit to the doc size that should be used? OR are there any recommendations?

It mostly depends on the embedding model that you use. In the case of SentenceTransformers, then smaller documents on either a sentence or paragraph level would be preferred. If you use something like Longformer, then longer documents might also work. In practice, if the documents are exceedingly long, then I would typically advise splitting them up into paragraphs or sentences.

Will it be okay to use both long and short documents?

If your model can capture both long and short documents, then this should be no problem!

RumiAllbert commented 1 year ago

Maarten, thank you for the speedy response. I will take what you have said into account. Btw do let me know about topic distribution approximation whenever you get it finished ;)