Closed RumiAllbert closed 1 year ago
Is there a limit to the doc size that should be used? OR are there any recommendations?
It mostly depends on the embedding model that you use. In the case of SentenceTransformers, then smaller documents on either a sentence or paragraph level would be preferred. If you use something like Longformer, then longer documents might also work. In practice, if the documents are exceedingly long, then I would typically advise splitting them up into paragraphs or sentences.
Will it be okay to use both long and short documents?
If your model can capture both long and short documents, then this should be no problem!
Maarten, thank you for the speedy response. I will take what you have said into account. Btw do let me know about topic distribution approximation whenever you get it finished ;)
Hi, Is there a limit to the doc size that should be used? OR are there any recommendations? Will it be okay to use both long and short documents?